Mapping the Chi Square Statistic

Background

The \(\chi^2\) statistic measures the dependency between the rows and columns in a table. WE experiment with a modified statistic to examine the spatial variation in the age-sex profiles of the Electoral Divisions (EDs) in Ireland.

First we load some libraries. The rgdal library provides functions for reading shapefiles, classInt has functions for determining the class intervals for any map using a variety of methods, and RColorBrewer contains some helpful colour palettes.

library(rgdal)
library(classInt)
library(RColorBrewer)
library(purrr)
library(glue)
library(sf)
library(tidyverse)
library(forcats)
library(ggspatial)

Next, boundary shapefiles are read for the EDs and administrative counties. WE also read the Small Area Population Statistics (SAPS) data for the EDs.

ED <- st_read("00_Irish_Boundary_Files",
              "Census2011_Electoral_Divisions_generalised20m",
              stringsAsFactors=FALSE)
CTY <- st_read("00_Irish_Boundary_Files",
              "Census2011_Admin_Counties_generalised20m",
              stringsAsFactors=FALSE)
SAPS <- read_csv(paste0("00_Irish_Census_2016/","AllThemesTablesED.csv")) %>% 
  rename(CSOED=GEOGID) %>% 
  mutate(CSOED = str_replace(CSOED,'E',''))

The order of the data differs between the spatial data and the Census data. The format of the codes in each dataset is also different, and where EDs have been combined the codes have been concatenated. In the ED file, the code is a 5 or 11 character string of the form 01001 for a single ED, and 01001/01002 for a merged one. In the SAPS data, these codes are preceded by the letter E (for ED), as in E01001 and E01001/01002.

Our contingency table will look at the age sex breakdown in each ED. We use 0-14, 15-24, 25-44, 45-64, and 65 and over, as the age categories. To accomplished this, column ranges in the SAPS file have to be combined, as below.

popData2 <- SAPS %>% select(CSOED)
popData2$M00_14 <- SAPS %>% select(matches('AGE(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)M')) %>% rowSums
popData2$M15_24 <- SAPS %>% select(matches('AGE(15|16|17|18|19|20_24)M')) %>% rowSums
popData2$M25_44 <- SAPS %>% select(matches('AGE(25_29|30_34|35_39|40_44)M')) %>% rowSums
popData2$M45_64 <- SAPS %>% select(matches('AGE(45_49|50_54|55_59|60_64)M')) %>% rowSums
popData2$M65old <- SAPS %>% select(matches('AGE(65_69|70_74|75_79|80_84|GE_85)M')) %>% rowSums

popData2$F00_14 <- SAPS %>% select(matches('AGE(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)F')) %>% rowSums
popData2$F15_24 <- SAPS %>% select(matches('AGE(15|16|17|18|19|20_24)F')) %>% rowSums
popData2$F25_44 <- SAPS %>% select(matches('AGE(25_29|30_34|35_39|40_44)F')) %>% rowSums
popData2$F45_64 <- SAPS %>% select(matches('AGE(45_49|50_54|55_59|60_64)F')) %>% rowSums
popData2$F65old <- SAPS %>% select(matches('AGE(65_69|70_74|75_79|80_84|GE_85)F')) %>% rowSums

The \(\chi^2\) formula is:

\[ \chi^2 = \sum_{j=1}^m\sum_{k=1}^{n} \frac{(O_{jk} - E_{jk})^2}{E_{jk}} \]

We can write a function:

chival <- function(O,std=TRUE) {
  if (std) O <- 100 * O / sum(O)
  E <- (rowSums(O) %o% colSums(O))/sum(O)
  sum((O-E)^2/E)
}

The std= parameter allows a value to be computed for a standard population of 100. The chi-square varies not only as the size of the table, but also the magnitude of the numbers in it.

Compute a value first for the whole of Ireland:

mat1 <- compose(list,lift_vd(partial(matrix,nrow=5,ncol=2)))

popData3 <- popData2 %>% group_by(CSOED) %>%
  summarise(popmat=mat1(M00_14,M15_24,M25_44,M45_64,M65old,
                     F00_14,F15_24,F25_44,F45_64,F65old)) %>%
  select(CSOED,popmat) %>% ungroup()

matname <- function(X,rnames,cnames) {
  rownames(X) <- rnames
  colnames(X) <- cnames
  return(X)
}

rn = c('<=14','15-24','25-44','45-64','>=65')
cn = c('M','F')

popData3 <- popData3 %>%
  mutate(popmat=map(popmat,matname,rnames=rn,cnames=cn))

print(popData3$popmat[1:3])

[[1]]
        M   F
<=14  256 260
15-24 383 367
25-44 705 591
45-64 459 422
>=65  327 411

[[2]]
        M   F
<=14  110  87
15-24  96  74
25-44 192 184
45-64 154 145
>=65  102 128

[[3]]
       M  F
<=14  56 70
15-24 40 26
25-44 75 79
45-64 79 74
>=65  30 31

IE_chisq <- chival(reduce(popData3$popmat,`+`))

cat(glue("The whole Ireland chi-square is {round(IE_chisq,3)}"))

The whole Ireland chi-square is 0.104

Then, compute the individual ED values:

N <- nrow(SAPS)
ED_chisq <- rep(0,N)
for(i in 1:N) ED_chisq[i] <-  chival(matrix(as.numeric(popData2[i,]),5,2))
popData2$lchisq <- ED_chisq

popData3 <- popData3 %>% mutate(lchisq=map_dbl(popmat,chival))

Now we can map the spatial distribution of the relative chi square values.

ED_chi <- ED %>% left_join(popData3)

Joining, by = "CSOED"

chimap <- ggplot(data=ED_chi,aes(fill=lchisq)) + geom_sf(col=NA) 
chimap + scale_fill_gradient(low='wheat',high='indianred')

There’s plenty of variation - it looks as if there’s more divergence in rural areas than urban areas. If we use a boxplot we can compare the distributions within each county.

tempdat <- ED_chi %>% select(CSOED,COUNTYNAME,lchisq) %>%
  mutate(COUNTYNAME=str_remove(COUNTYNAME,' *County *')) %>%
  mutate(COUNTYNAME=fct_reorder(COUNTYNAME,lchisq))
boxplots <- ggplot(data=tempdat,aes(x=COUNTYNAME,y=lchisq,group=COUNTYNAME)) +
  geom_boxplot() + coord_flip()
boxplots

These are very skewed - their might be some merit in logging this

boxplots + scale_y_log10()

And we can map the logs

chimap + scale_fill_gradient(low='wheat',high='indianred',trans='log10')

We’ll do this without the standardisation

IE_chisq_no_std <- chival(reduce(popData3$popmat,`+`),std=FALSE)

cat(glue("The whole Ireland chi-square (non standardised) is {round(IE_chisq_no_std,1)}"))

The whole Ireland chi-square (non standardised) is 4758.7

ED_chi<- ED_chi %>%
  mutate(lchisq_no_std=map_dbl(popmat,chival,std=FALSE))

tempdat <- ED_chi %>% select(CSOED,COUNTYNAME,lchisq_no_std) %>%
  mutate(COUNTYNAME=str_remove(COUNTYNAME,' *County *')) %>%
  mutate(COUNTYNAME=fct_reorder(COUNTYNAME,lchisq_no_std))
boxplots <- ggplot(data=tempdat,aes(x=COUNTYNAME,y=lchisq_no_std,group=COUNTYNAME)) +
  geom_boxplot() + coord_flip()
boxplots

boxplots + scale_y_log10()

These can also be mapped:

chimap <- ggplot(data=ED_chi,aes(fill=lchisq_no_std)) + geom_sf(col=NA) 
chimap + scale_fill_gradient(low='wheat',high='indianred')

Again, a log scale can be used:

chimap + scale_fill_gradient(low='wheat',high='indianred',trans='log10')

Now let’s try Cramer’s V… With \(\chi^2\) defined as above \(V_c = \sqrt{\frac{\chi^2/n}{\textrm{min}(r-1,c-1)}}\)

cramval <- function(O) {
  E <- (rowSums(O) %o% colSums(O))/sum(O)
  chisq_v <- sum((O-E)^2/E)
  return(sqrt(chisq_v/(sum(O)*(min(dim(O))-1))))
}

ED_chi<- ED_chi %>%
  mutate(lcramer=map_dbl(popmat,cramval))

tempdat <- ED_chi %>% select(CSOED,COUNTYNAME,lcramer) %>%
  mutate(COUNTYNAME=str_remove(COUNTYNAME,' *County *')) %>%
  mutate(COUNTYNAME=fct_reorder(COUNTYNAME,lcramer))
boxplots <- ggplot(data=tempdat,aes(x=COUNTYNAME,y=lcramer,group=COUNTYNAME)) +
  geom_boxplot() + coord_flip()
boxplots

It is also possible to use the bias corrected version of Cramer’s \(V\), denoted \(\tilde{V}\):

\[ \tilde{V} = \sqrt{\frac{\tilde{\psi}^2}{\textrm{min}(\tilde{k} - 1, \tilde{r} - 1)}} \]

where

\[ \tilde{\phi}^2 = \textrm{max}\left(0,\psi^2 - \frac{(k - 1)(r - 1)}{n - 1}\right) \]

and

\[ \psi^2 = \chi^2/n \]

\[ \tilde{k} = k - \frac{(k - 1)^2}{n - 1} \]

\[ \tilde{r} = r - \frac{(r - 1)^2}{n - 1} \]

cramval_bc <- function(O) {
  E <- (rowSums(O) %o% colSums(O))/sum(O)
  psisq_v <- max(0,sum((O-E)^2/E)/sum(O) - (nrow(O)-1)*(ncol(O) - 1)/(sum(O) - 1)) 
  kt <- nrow(O) - (nrow(O) - 1)^2/(sum(O) - 1)
  rt <- ncol(O) - (ncol(O) - 1)^2/(sum(O) - 1)
  return(sqrt(psisq_v/(min(kt,rt)-1)))
}

ED_chi<- ED_chi %>%
  mutate(lcramer_bc=map_dbl(popmat,cramval_bc))

tempdat <- ED_chi %>% select(CSOED,COUNTYNAME,lcramer_bc) %>%
  mutate(COUNTYNAME=str_remove(COUNTYNAME,' *County *')) %>%
  mutate(COUNTYNAME=fct_reorder(COUNTYNAME,lcramer_bc))
boxplots <- ggplot(data=tempdat,aes(x=COUNTYNAME,y=lcramer_bc,group=COUNTYNAME)) +
  geom_boxplot() + coord_flip()
boxplots

… and in map form:

chimap <- ggplot(data=ED_chi,aes(fill=lcramer_bc)) + geom_sf(col=NA) 
chimap + scale_fill_gradient(low='wheat',high='indianred')

Now, let’s think outside the box. The \(\chi^2\) measures the difference between an observed and an expected value. We often teach students that the expexted frequencies can be computed as the product of the corresponding row and column totals of the cell in question divided by the sum of the individuals in the table.

In actuality the expected value in a particular cell is the joint expectation of being in the corresponding row and column. The probabilities are computed under the hypothesis of independence. However, if the probabilities are those which correspond the the frequencies that would be expected in the local table reflected the national pattern, then the resulting \(\chi^2\) statistic would measure the deviance from the national pattern rather than independence.

We can also compute the Cramer’s V statistics to correct for the differening populations of the data zones. As V approaches 1, this provides stronger evidence that the local table differs in several respects from the national table.

chival_ext <- function(O,E,rescale=TRUE) {
  Ep <- E/sum(E)
  Ec <- sum(O) * Ep
  chisq <- sum((O-Ec)^2/Ec)
  if (rescale) {
    Oext <- (Ec == min(Ec)) * sum(O) # Naughty but works
    chisq_ext <- sum((Oext - Ec)^2/Ec)
    chisq <- chisq/chisq_ext
  }
  return(chisq)
}

nat_table <- reduce(ED_chi$popmat,`+`)

ED_chi<- ED_chi %>%
  mutate(lchisq_nat=map_dbl(popmat,chival_ext,E=nat_table))

tempdat <- ED_chi %>% select(CSOED,COUNTYNAME,lchisq_nat) %>%
  mutate(COUNTYNAME=str_remove(COUNTYNAME,' *County *')) %>%
  mutate(COUNTYNAME=fct_reorder(COUNTYNAME,lchisq_nat))
boxplots <- ggplot(data=tempdat,aes(x=COUNTYNAME,y=lchisq_nat,group=COUNTYNAME)) +
  geom_boxplot() + coord_flip()
boxplots

chimap <- ggplot(data=ED_chi,aes(fill=lchisq_nat)) + geom_sf(col=NA) 
chimap + scale_fill_gradient(low='wheat',high='indianred',trans=scales::boxcox_trans(p=1/4))

Next, aggregate over counties - then it is possible to compute aggrated statistics. The key is to aggregate the contingency tables at county level, then apply the individual stats as required.

CTY_chi <- ED_chi %>% 
  group_by(COUNTYNAME) %>% summarise(popmat=list(reduce(popmat,`+`))) %>%
  mutate(lchisq_nat=map_dbl(popmat,chival_ext,E=nat_table))

chimap <- ggplot(data=CTY_chi,aes(fill=lchisq_nat)) + 
  annotation_map_tile(type='osm',zoomin=0) + geom_sf(col=NA,alpha=0.7) 
chimap + scale_fill_gradient(low='wheat',high='indianred')

Loading required namespace: raster

Zoom: 8

Now lets create a grid-based coverage and compute local values on that:

hexes <- st_make_grid(CTY_chi,cellsize=20000,square = FALSE) %>% st_sf()
hexes <- hexes %>% transmute(hex_label = paste("Hexagon",sprintf('%03i',row_number())))
hexmap <- ggplot(data=hexes) + 
  annotation_map_tile(type='osm',zoomin=0) + geom_sf(alpha=0.7) 
hexmap

Zoom: 7

then aggregate to hexagons:

hex_agg <- st_join(ED_chi,hexes) %>% group_by(hex_label) %>% summarise(popmat=list(reduce(popmat,`+`))) %>% st_set_geometry(NULL)
hexes <- hexes %>% left_join(hex_agg)

Joining, by = "hex_label"

hexes <- hexes %>% 
  mutate(lchisq_nat=map_dbl(popmat,chival_ext,E=nat_table))

… and map the result:

chimap <- ggplot(data=hexes,aes(fill=lchisq_nat)) + 
  annotation_map_tile(type='osmgrayscale',zoomin=0) + geom_sf(col=NA,alpha=0.6) 
chimap + scale_fill_gradient(low='wheat',high='indianred')

Zoom: 7

These can be combined to a distance based diffusion (ie kernel) estimator. Firstly compute the distances.

dmat <- hexes %>% st_distance(ED_chi) %>% units::drop_units()
rownames(dmat) <- hexes$hex_label
colnames(dmat) <- ED_chi$CSOED

Next convert these in a sparse weights matrix. The format I use here is a three-column table with

Hexagon ID
CSO ED code
Weight

but only non zero weights are in the table. I’m using a bisquare kernel so there will be quite a few zero weights, none of which need to be stored or used in the calculation.

The aggregation is then just a weighted sum of contingency tables. The rescaled \(\chi^2\) statistic (comparing to national aggregate values) is then computed and joined back to the original hexagon layer.

d2w <- function(x,h) {
  res <- ifelse(x < h, (1 - (x/h)^2)^2, 0.0)
  return(res / sum(res))
}
wmat <- apply(dmat,2,partial(d2w,h=20000))

pull_pos <- function(hl,wmat) {
  vec <- wmat[hl,]
  ind <- colnames(wmat)[which(vec>0)]
  tibble(CSOED=ind,wt=wmat[hl,ind])
}

test2 <- hexes %>% group_by(hex_label) %>%
  do(w=pull_pos(.$hex_label,wmat)) %>% 
  unnest() %>% 
  group_by(hex_label)  %>% 
  left_join(ED_chi) %>% 
  mutate(popmatw=map2(popmat,wt,~{.x*.y})) %>% 
  summarise(popmatw = list(reduce(popmatw,`+`))) %>%
  transmute(lchisq_nat_w=map_dbl(popmatw,chival_ext,E=nat_table),
            hex_label)

Joining, by = "CSOED"

hexes <- hexes %>% left_join(test2)

Joining, by = "hex_label"

This is then mapped.

chimap <- ggplot(data=hexes,aes(fill=lchisq_nat_w)) + 
  annotation_map_tile(type='osmgrayscale',zoomin=0) + geom_sf(col=NA,alpha=0.6) 
chimap + scale_fill_gradient(low='wheat',high='indianred')

Zoom: 7

Mapping the Chi Square Statistic

Martin Charlton and Chris Brunsdon

27 November 2018

Background