1 Introduction

This tutorial shows how to fit a multilevel index of dissimilarity in the open source software, R, which can be downloaded for free from the Comprehensive R Acrhive Network. The Index of dissimilarity (ID) is one of the most widely used measures of segregation. It compares the geographical distribution of one group of people with the geographical distribution of another and asks whether the places where the first group is most commonly found are also the places where the second group is too.

Recently there has been interest in multilevel and multiscale methods of measuring segregation that allow the scales of segregation to be examined simultaneously, thereby considering the micro-, meso- and macro-level effects separately. The multilevel index of dissimilarity (MLID) takes forward this approach.

The case study, explored here, looks at the residential separation of the Indian and White British populations enumerated by the 2011 UK Census data for England and Wales. As well as outlining how to fit the multilevel index, it explores various ways of examining spatial and scale effects and their impacts upon a traditional ID score. Familiarity with R is assumed, although for those more interested in than the method than the syntax, the commands can simply be cut and pasted into R and run.

1.1 Copyright notice

This tutorial is copyright (c) Richard Harris, 2016 but may be reproduced for non-commercial purposes only. It may be cited as:

Harris R (2016) Fitting a multilevel index of dissimilarity: a tutorial in R. Available at http://rpubs.com/profrichharris

2 Preperation

Before beginning the tutorial you first need to:

Make sure the required libraries are available in R;
Download the required data; and
Load some additional functions available in a source file.

2.1 Installing the packages

The main library needed for the multilevel index is lme4, which is used for the model fitting. The package rio is used to import the data.

These packages need only to be installed once (or, if necessary, updated). The following code checks for the packages and installs them if they cannot be found.

# The following packages are required
packages <- c("rio","lme4")
# Check to see which packages are already available and install those that are not
installed <- installed.packages()[,1]
sapply(packages, function(x) if(!length(which(installed==x))) install.packages(x))

Should you need to update the packages, you can do so using the following code:

# Check to see which packages have newer versions and update those that do
old <- old.packages()[,1]
sapply(packages, function(x) if(length(which(old==x))) update.packages(x, ask=FALSE))

Alternatively, if you want to ‘manually’ (re-/)install the packages you can do so using

install.packages(packages)

2.2 Obtaining the data

The example case study is an analysis of 2011 Census data for England and Wales. The data give the number of each ethnic group in each Census small area neighbourhood, known as Output Areas (OAs). OAs are the smallest geographical unit for which Census data are released. UK Census data can be downloaded from InFuse. For ease of use, the required data have been uploaded to GitHub, from which they can be downloaded using the rio library:

require(rio)
census <- import("https://github.com/profrichharris/MLID/blob/master/ethnicity.csv.zip?raw=true")
head(census, n=3)

##          OA Persons WhiteBrit Irish Gypsy OtherWhite MixedWhBC MixedWhBA
## 1 E00000001     194       150     7     0         18         3         0
## 2 E00000003     250       177     2     1         26         0         1
## 3 E00000005     367       254    14     0         53         0         2
##   MixedWhAs MixedOther Indian Pakistani Bangladeshi Chinese OtherAsian
## 1         4          3      2         0           0       4          0
## 2         7          1     17         0           3       3          3
## 3         5          5      9         1           0      10          5
##   BlkAfrican BlkCaribbean BlackOther Arab Other
## 1          0            0          0    0     3
## 2          3            0          0    0     6
## 3          2            0          2    0     5

The Census geography is hierarchical - OAs group into what are called Lower Level Output Areas (LLOAs) at an aggregated scale. Those, in turn, group into Middle Level Output Areas (MLOAs), which group into Local Authority Districts (LADs), Government Office Regions (GORs) and countries. Lookup files indicating to which LLOA, MLOA, LAD and GOR each OA belongs are available from the Office for National Statistics. Again, for ease of use, the required data can be downloaded from GitHub:

lookup <- import("https://github.com/profrichharris/MLID/blob/master/lookup.csv.zip?raw=true")
head(lookup, n=3)

##          OA  LSOA11CD            LSOA11NM  MSOA11CD           MSOA11NM
## 1 E00000001 E01000001 City of London 001A E02000001 City of London 001
## 2 E00000003 E01000001 City of London 001A E02000001 City of London 001
## 3 E00000005 E01000001 City of London 001A E02000001 City of London 001
##     LAD11CD        LAD11NM   RGN11CD RGN11NM  CTRY11CD CTRY11NM
## 1 E09000001 City of London E12000007  London E92000001  England
## 2 E09000001 City of London E12000007  London E92000001  England
## 3 E09000001 City of London E12000007  London E92000001  England

The census data and the lookup file have a field in common, which is the names of the OAs. This will be used to join them together:

census <- merge(census, lookup, by="OA")
head(census, n=3)

##          OA Persons WhiteBrit Irish Gypsy OtherWhite MixedWhBC MixedWhBA
## 1 E00000001     194       150     7     0         18         3         0
## 2 E00000003     250       177     2     1         26         0         1
## 3 E00000005     367       254    14     0         53         0         2
##   MixedWhAs MixedOther Indian Pakistani Bangladeshi Chinese OtherAsian
## 1         4          3      2         0           0       4          0
## 2         7          1     17         0           3       3          3
## 3         5          5      9         1           0      10          5
##   BlkAfrican BlkCaribbean BlackOther Arab Other  LSOA11CD
## 1          0            0          0    0     3 E01000001
## 2          3            0          0    0     6 E01000001
## 3          2            0          2    0     5 E01000001
##              LSOA11NM  MSOA11CD           MSOA11NM   LAD11CD
## 1 City of London 001A E02000001 City of London 001 E09000001
## 2 City of London 001A E02000001 City of London 001 E09000001
## 3 City of London 001A E02000001 City of London 001 E09000001
##          LAD11NM   RGN11CD RGN11NM  CTRY11CD CTRY11NM
## 1 City of London E12000007  London E92000001  England
## 2 City of London E12000007  London E92000001  England
## 3 City of London E12000007  London E92000001  England

At this point we can do a bit of tidying up, leaving all but the census table

rm(list=ls()[-(which(ls()=="census"))])
ls()

## [1] "census"

2.3 Loading the source file

This includes some additional functions that will be used in the analysis.

source("https://raw.githubusercontent.com/profrichharris/MLID/master/MLID-functions.R")

3 The Index of Dissimilarity (ID)

As an illustrative example, the ID will be calculated to examine the residential separation of the Indian population from the White British population, using the 2011 Census data.

3.1 A conventional calculation

The Index of Dissimilarity (ID) may be expressed as

\[ {ID}_{XY}={0.5}\times\sum{\big|Y-X\big|} \]

where \(Y = \frac{y}{\sum{y}}\) and \(X = \frac{x}{\sum{x}}\). In the case study, \(y\) is the count of Indian residents per OA, and \(x\) is the count of White British. This means that Y is the share of the total Indian population that resides in each OA, and X is the share of the White British population. We can consider Y as an observed value and X as an expected value with which Y is compared. The logic is that if, say, 1 per cent of the entire White British population live in a given OA then 1 per cent of the Indian population ought to live there too; or, if 2 per cent of the White British population, then 2 per cent of Indians as well. But should it be found that the places with the higher shares of the Indian residents are also the places where the White British shares are lower then the difference suggests that the two groups tend to live apart.

Calculating the index in R,

attach(census)
Y <- Indian / sum(Indian)
X <- WhiteBrit / sum(WhiteBrit)
detach(census)
ID <- 0.5 * sum(abs(Y - X))
round(ID,3)

## [1] 0.7

which, when multiplied by 100, indicates that 70 percent of either the Indian or the White British population would have to move for the share of the total Indian population to be equal to the share of the White British population in each OA.

In principle the index ranges from 0 (‘no segregation’) to 1 (‘complete segregation’: where the White British reside, Indians do not, and vice versa) or from 0 to 100 on a percentage scale. A value of 0.7 (70 per cent) is high (although ethnic segregation is, in fact, falling in England and Wales) and arises because the Indian population, which makes up only 2.52 per cent of the Census population for England and Wales, is much more concentrated in cities than is the much larger White British population (at 80.49 per cent of the total Census population).

3.2 Using regression

The same calculation can be made by fitting a regression model with no intercept and a gradient fixed at one.

If \[ Y = \beta_{0} + \beta_{1}X + \epsilon \]

then, setting \(\beta_{0} = 0\) and \(\beta_{1} = 1\) gives,

\[ Y - X = \epsilon \]

\[ {ID}_{XY}={0.5}\times\sum{\big|Y-X\big|} \]

\[ {ID}_{XY}={0.5}\times\sum{\big|\epsilon\big|} \]

where \(\epsilon\) are the regression residuals. Calculating this in R,

ols <- lm(Y ~ 0, offset=X)
ID <- 0.5 * sum(abs(residuals(ols)))
round(ID,3)

## [1] 0.7

By definition, this results in the same index value. Although it may seem convoluted, placing the index in a regression framework is useful for exploring spatial and scale effects and forms the basis of the multilevel approach.

4 The index value under randomisation

Although, in principle, the index ranges from 0 to 1, when one group is much larger than the other (as is the case here) then it is numerically difficult, if not impossible, for their shares to be alike in each and every OA even if there is no real segregation. Furthermore, random errors in the census counts can affect the index value (due, perhaps, because one family was overlooked or because one person accidentally ticked the wrong box on the census form). The varying size and shape of the census OAs also affect the results.

The expected value for the ID, given the size of each group in the total population, and also given the geography of the study region (the total number of people living in each OA), can be estimated by simulation.

First, we calculate the prevalence of each group in the total population:

attach(census)
Py <- sum(Indian) / sum(Persons)
Px <- sum(WhiteBrit) / sum(Persons)
detach(census)
round(c(Py, Px), 3)

## [1] 0.025 0.805

Next we generate a random number of Indians per OA with probability of selection equal to their overall prevalence in the total population. We also do the same for the White British.

set.seed(16092016)    # This would normally be omitted
N <- nrow(census)     # This is the number of OAs in the study region
t <- census$Persons   # This is the total number of residents in each OA
y <- rbinom(N, t, Py)
x <- rbinom(N, t, Px)

We could now calculate the ID ‘by hand’ for these \(x\) and \(y\) values but instead we will use a short-cut function in R that is included in the source file loaded earlier,

# The function adopts R's formula syntax to indicate we are considering y with respect to (~) x
EID <- calcID(y ~ x, data=data.frame(x,y))
EID

## [1] 0.14

The procedure outlined above is similar to shuffling the individual pupils randomly around the OAs and calculating the ID that results after doing so. It is possible, however, that the simulated data (being random) are unusual. We will therefore repeat the process 100 times (using a function that, unlike above, also allows the total population size in each OA to vary a little too).

simulations <- IDsim(Indian ~ WhiteBrit + Persons, data = census, times=100)

Having generated the 100 replicates, we consider the spread and average of the results:

quantile(simulations, probs=c(0.005, 0.025, 0.5, 0.975, 0.995))

##  0.5%  2.5%   50% 97.5% 99.5% 
## 0.140 0.140 0.140 0.141 0.141

round(mean(simulations),4)

## [1] 0.1404

This indicates that the expected value under randomisation is about 0.1404 which is equal to 20.1 per cent of the actual value of 0.7. This is sizable but still means that the actual value is 4.99 times greater than what is expected under randomisation. Note that if you are running the practical yourself the results may differ slightly because of the randomisation.

5 Thinking hierarchically with the data

5.1 Calculating the influence of sub-regions

The index of dissimilarity is calculated as the sum of a series of localised differences - in the current case study, the difference in the share of the Indian and White British residents per Census OA. Denoting each location with the subscript \(i\) then

\[ {ID}_{XY}={0.5}\times\sum{\big|\epsilon_{i}\big|} \]

where

\[ \epsilon_{i} = Y_{i} - X_{i} \]

and \(X\) and \(Y\) are defined as previously.

Recall, that each OA groups into a Local Authority District (LAD) at a higher scale of aggregation (as well as into LLOAs, MLOAs, regions and countries). The contribution of each local authority (\(j\)) to the overall index can be calculated as

\[ \%share_{j} = {100}\times\frac{\sum_{i=1}^{N}w_{i}\big|\epsilon_{i}\big|}{\sum_{i=1}^{N}\big|\epsilon_{i}\big|} \]

where \(N\) is the total number of OAs and where \(w_{i} = 1\) if OA \(i\) belongs to local authority \(j\), else \(w_{i} = 0\).

A problem with this measure is that the number of OAs is not the same in each LAD. Those that contain more OAs are likely to contribute more to the overall index value. An alternative is to measure the relative impact of each region upon the overall index value as

\[ impact_{j} = {100}\times\frac{\sum_{i=1}^{N}w_{i}\big|\epsilon_{i}\big|}{n_{j}\overline{\big|\epsilon\big|}} \hspace{1cm}n_{j} = \sum_{i=1}^{N} w_{i} \]

where a score of 100 indicates the impact is as expected, 200 indicates it is double expectation, and 50 that it is half. The values can also be scaled to help identify ‘significant’ impacts,

\[ {z}\approx\frac{\sum_{i=1}^{N}w_{i}\big|\epsilon_{i}\big|}{n_{j}(se_{\epsilon})} \]

where \(se_{\epsilon}\) is the standard error of the residuals from the regression method of calculating the index.

Each of these values can be examined in R,

# Consider Indians with respect to (~) the White British and calculate by (|) LAD
LAD.values <- regional.values(Indian ~ WhiteBrit | LAD11NM, data=census)
# Sort the results by incrasing impact values:
LAD.values <- LAD.values[order(LAD.values[,2]),]
# Identify the LADs with greatest impact upon the index:
tail(LAD.values, n=3)

##             share impact     z n_j
## Brent     2.84924    623 2.812 829
## Leicester 4.53739    849 3.831 969
## Harrow    3.07134    868 3.914 642

# Determine the 95% cut-off point for a standard Normal curve (one-tailed test):
zmin <- qnorm(p=0.05, lower.tail=FALSE)
# Find the places above this threshold:
subset <- LAD.values[,3] > zmin
LAD.values[subset,]

##                         share impact     z n_j
## Hillingdon            1.64269    378 1.703 789
## Blackburn with Darwen 1.03159    427 1.927 438
## Ealing                2.28497    434 1.956 956
## Newham                2.07101    464 2.092 810
## Oadby and Wigston     0.45901    468 2.110 178
## Slough                1.03427    485 2.187 387
## Redbridge             2.16421    506 2.282 776
## Hounslow              2.29795    584 2.633 714
## Brent                 2.84924    623 2.812 829
## Leicester             4.53739    849 3.831 969
## Harrow                3.07134    868 3.914 642

The output shows that 11 local authorities make a ‘significant’ contribution to the overall ID value of 0.7, of which the most significant is Harrow, with an impact score 8.68 times greater than expected.

At the regional scale, none of the regions emerges as significant although London makes a bigger impact upon the index than any other region.

# Consider Indians with respect to (~) the White British and calculate by (|) Region
RGN.values <- regional.values(Indian ~ WhiteBrit | RGN11NM, data=census)
RGN.values <- RGN.values[order(RGN.values[,2]),]
subset <- RGN.values[,3] > zmin
nrow(RGN.values[subset,])

## [1] 0

RGN.values

##                             share impact     z   n_j
## North East                3.45787     71 0.321  8802
## East of England           7.49168     72 0.323 18995
## South West                7.09548     73 0.329 17644
## South East               11.42240     75 0.338 27638
## Wales                     4.12498     75 0.336 10036
## Yorkshire and The Humber  7.67116     81 0.363 17275
## North West               11.05192     86 0.387 23343
## West Midlands            12.36869    125 0.565 17916
## East Midlands            11.01214    136 0.613 14706
## London                   24.30368    176 0.794 25053

5.2 Fitting a multilevel model

In essence, what the measures above do is average the OA data over sub-parts of the study region and identify the parts where the average is higher or lower, therefore having greater or lesser impact upon the overall ID value. This is useful but doesn’t answer the question of at which scale residential segregation is occurring: do you get greater differences between places at the OA scale, the LLOA scale, the MLOA scale, the LAD scale, the regional scale, or…? To answer this we need to measure the OA scale effects net of (having allowed for) effects at other scales of analysis, the LLOA scale effects net of the same, and so on for all the other geographical levels included in the model.

To achieve this, the regression method of calculating the index is extended to a multilevel model:

If \[ Y = \beta_{0} + \beta_{1}X + \zeta_{i} + \eta_{j} + \theta_{k} ... \] where \[ \epsilon = \zeta_{i} + \eta_{j} + \theta_{k} ... \]

then, setting \(\beta_{0} = 0\) and \(\beta_{1} = 1\) gives,

\[ {ID}_{XY}={0.5}\times\sum{\big|\zeta_{i} + \eta_{j} + \theta_{k} ...\big|} \]

For example, consider a five level model that includes the OAs, LLOAs, MLOAs, LADs and regions. This may be fitted in R as follows.

# This may take a little while to run
# Here the mlid function is a wrapper for the lmer function in lme4
# It will convert the raw counts of Indians and White British per OA into shares of the same
mlm <- mlid(Indian ~ 0 + (1|LSOA11NM) + (1|MSOA11NM) + (1|LAD11NM) + (1|RGN11NM), offset="WhiteBrit", data=census)

The residuals from this model may be viewed and confirmed to sum up to what we know is the index value of 0.7:

resids <- rvals(mlm)
head(resids, n=3)

##            Base      LSOA11NM      MSOA11NM       LAD11NM      RGN11NM
## 1 -4.143859e-06 -1.187186e-07 -5.548247e-06 -3.561508e-06 1.146442e-05
## 2  5.873959e-06 -1.187186e-07 -5.548247e-06 -3.561508e-06 1.146442e-05
## 3 -1.493927e-06 -1.187186e-07 -5.548247e-06 -3.561508e-06 1.146442e-05

rs <- rowSums(resids)
ID <- 0.5 * sum(abs(rs))
round(ID, 3)

## [1] 0.7

Fitting the model is therefore straightforward. The critical question is how to interpret it?

5.3 Interpreting the multilevel model

A summary of the multilevel model is provided as follows,

summary(mlm)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Y ~ 0 + (1 | LSOA11NM) + (1 | MSOA11NM) + (1 | LAD11NM) + (1 |  
##     RGN11NM)
##    Data: data
##  Offset: X
## 
##      AIC      BIC   logLik deviance df.resid 
## -3710039 -3709989  1855025 -3710049   181403 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -21.141  -0.214  -0.036   0.152  39.289 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. 
##  LSOA11NM (Intercept) 3.377e-11 5.811e-06
##  MSOA11NM (Intercept) 1.085e-10 1.041e-05
##  LAD11NM  (Intercept) 6.962e-11 8.344e-06
##  RGN11NM  (Intercept) 2.442e-11 4.941e-06
##  Residual             5.221e-11 7.225e-06
## Number of obs: 181408, groups:  
## LSOA11NM, 34753; MSOA11NM, 7201; LAD11NM, 348; RGN11NM, 10

Of interest are the variance values at each level of the geographic hierarchy. These can be interpreted as a measure of spatial heterogeneity: the greater their value, the more the differences between the Indian and White British populations vary at the given scale. For the current model, the greatest variance is at the MSOA11NM scale, which contributes 37.6 of the total, as shown below.

sort(varshare(mlm), decreasing=T)

## MSOA11NM  LAD11NM     Base LSOA11NM  RGN11NM 
##    37.60    24.14    18.10    11.71     8.46

Less variance does not necessarily equate to less segregation because the model is based on the difference between Y and X, the shares of the Indian and White British populations. It is possible for Y to be much greater than X at a given scale - suggesting greater segregation - but consistently so, with little variation. Equally, it is possible for Y to be not much greater than X but for the differences to vary from place to place at that scale. In short, the variances need to be considered alongside the mean effect at each level:

sort(apply(abs(resids), 2, mean), decreasing=T)

##      LAD11NM      RGN11NM     MSOA11NM         Base     LSOA11NM 
## 4.872292e-06 3.767247e-06 3.640699e-06 2.882748e-06 1.637093e-06

or, equivalently,

sort(means(mlm), decreasing=T)

Together, the results suggest that segregation effects are strongest at the LAD11NM scale but more variable at the MSOA11NM scale.

There is a third way of exploring the results which is to look at what happens to the ID if the residual values (the values of \(\zeta_{i}\), \(\eta_{j}\), \(\theta_{k}\), etc.) are set to zero, which is equivalent to holding them back from the calculation. What is the change in the index value that results? Considered as a percentage change in the original ID value, the answers are given by,

sort(holdback(mlm))

##  RGN11NM  LAD11NM     Base MSOA11NM LSOA11NM 
##    -14.7     -7.5     -7.2     -6.0     -2.3

which show that the greatest impact occurs at the RGN11NM scale. Although this may appear to contradict the previous finding, it does not. Although the segregation effect may be smaller, on average, at the regional scale than at the local authority scale, the region is a larger geographical entity and spreads more widely.

Finally, we can explore visually which places at any given scale a greater share of the Indian population than the White British population or vice versa. These can be shown on a caterpillar plot, which is better suited to the levels where there are fewer places to display - for example, at the regional scale. To produce it, the information required for the confidence interval is calculated,

var <- condVar(mlm)

and then plotted with a 95 per cent interval using either the method outlined by Goldstein and Healy,

catplot(mlm, var, level=5, method="goldstein", labels=T)

or, more quickly (and not noticeably different in this case) as,

catplot(mlm, var, level=5, method="quick", labels=T)

In either case, it can be seen that London stands apart from the other Government Office Regions. Any that is above the zero, dotted line has an excess share of the Indian population relative to the White British population, at the regional level. In fact, the percentage of the regional population that is Indian is much higher for London than for any other region, at 6.64 per cent:

with(census, round(sort(100 * tapply(Indian, RGN11NM, sum) / tapply(Persons, RGN11NM, sum), decreasing=T),2))

##                   London            West Midlands            East Midlands 
##                     6.64                     3.90                     3.73 
##               South East               North West          East of England 
##                     1.76                     1.52                     1.48 
## Yorkshire and The Humber               South West               North East 
##                     1.31                     0.65                     0.61 
##                    Wales 
##                     0.56

6 Extensions

One of the reasons for the the Index of Dissimilarity’s popularity is that it compares the share of group Y with the share of group X in each small level area, regardless of how many of groups Y and X there are in the total population. Imagine that between two time periods the number of group Y doubled in every area whilst (for some reason) the number of group X declined by one quarter. In most circumstances, the index value for the two periods would be exactly the same because, relative to their total number, the geographical distribution of the two groups is unchanged. In this regard, the index is invariant to the group size. This is not true, however, if X and Y add up to the total population, which would be the case if, for example, Y was a count of benefit claimants and X was a count of non-claimants. In this circumstance, the Gorard index may be preferred,

\[ {GI}_{XY}={0.5}\times\sum{\big|Y-T\big|} \]

where \(Y = \frac{y}{\sum{y}}\) and \(T = \frac{x+y}{\sum{x+y}}\). All that is required is to change the offset in the models above.

Further extensions would be consider ‘overspill’ effects from neighbouring locations and to look for evidence of spatial clustering across the boundaries defined by the multilevel hierarchy. This is in the spirit of integrating spatial econometric approaches with multilevel ones, for which the HSAR: Hierarchical Spatial Autoregressive Model has been developed. There is also potential to apply Geographically Weighted methods, available in the GWmodel package.

Finally, it should be noted that the Index of Dissimilarity (and Gorard Index) are only one of a swathe of techniques available to measure segregation. A set of tools for measuring spatial segregation is available in the seg package.