Spatial microsimulation in R

Introduction

This is an R Markdown document to demonstrate how iterative proportional fitting (IPF) can be used to perform spatial microsimulation: sampling from an individual-level dataset to fill areas about which only aggregated values are known. This procedure provides the context for a paper entitled “‘Truncate, replicate, sample’: a method for creating integer weights for spatial microsimulation”.

Input data

Spatial microsimulation requires two input datasets: individual-level data, where rows represent individuals, and geographically aggregated data, where rows represent areas. The follow code creates example datasets, based on a survey of 10 individuals and 5 small areas. The spatial microsimulation model will select individuals based on age, sex and mode of transport. For consistency with the (larger) model used for the paper, we will refer to the individual-level data as USd (short for Understanding Society dataset) and the geographic data as all.msim (all constraint variables).

Individual-level data

# Read in the data in long form (normaly read.table() used)
c.names <- c("id", "age", "sex", "mode")
USd <- c(1, 27, "m", "car.d", 2, 54, "m", "car.d", 3, 35, "f", "bus", 4, 42, 
    "m", "walk", 5, 50, "f", "car.p", 6, 19, "m", "car.d", 7, 62, "f", "car.d", 
    8, 21, "f", "bicycle", 9, 38, "f", "walk", 10, 48, "f", "car.d")
USd <- matrix(USd, nrow = 10, byrow = T)  # Convert long data into matrix, by row
USd <- data.frame(USd)  # Convert this into a dataframe
names(USd) <- c.names  # Add correct column names
USd$age <- as.numeric(levels(USd$age)[USd$age])  # Age is a numeric variable

USd  # Show the data frame in R

##    id age sex    mode
## 1   1  27   m   car.d
## 2   2  54   m   car.d
## 3   3  35   f     bus
## 4   4  42   m    walk
## 5   5  50   f   car.p
## 6   6  19   m   car.d
## 7   7  62   f   car.d
## 8   8  21   f bicycle
## 9   9  38   f    walk
## 10 10  48   f   car.d

Geographical data

# Read in the data in long form (normaly read.table() used)
category.labels <- c("16-30", "31-50", "50+" # Age constraint 
             ,"m", "f" # Sex constraint
             ,"bicycle", "bus", "car.d", "car.p", "walk" # Mode constraint
             )
all.msim <- c(  3, 3, 4,   5, 5,   0.001, 1, 8, 1, 0.001, # Car dominated
                2, 2, 6,   4, 6,   0.001, 3, 5, 1, 1, # Elderly
                3, 4, 4,   3, 8,   1, 2, 5, 2, 1, # Female dominated
                3, 3, 3,   7, 2,   2, 1, 3, 1, 2, # Male dominated
                7, 2, 1,   6, 4,   7, 0.001, 2, 0.001, 1  # Many cyclists, young
                )
all.msim <- matrix(all.msim, nrow = 5, byrow = T) # Convert long data into matrix, by row
all.msim <- data.frame(all.msim) # Convert this into a dataframe
names(all.msim) <- c.names # Add correct column names
all.msim # Show the data frame in R

##   id age sex mode NA    NA    NA NA    NA    NA
## 1  3   3   4    5  5 0.001 1.000  8 1.000 0.001
## 2  2   2   6    4  6 0.001 3.000  5 1.000 1.000
## 3  3   4   4    3  8 1.000 2.000  5 2.000 1.000
## 4  3   3   3    7  2 2.000 1.000  3 1.000 2.000
## 5  7   2   1    6  4 7.000 0.001  2 0.001 1.000


# Check totals for each constraint match
rowSums(all.msim[,1:3]) # Age constraint

## [1] 10 10 11  9 10

rowSums(all.msim[,4:5]) # Sex constraint

## [1] 10 10 11  9 10

rowSums(all.msim[,6:10])# Mode constraint

## [1] 10 10 11  9 10

Reweighting the survey dataset

Iterative proportional fitting will determine the weight allocated to each individual for each zone to best match the geographically aggregated data. A weight matrix is therefore created, with rows corresponding to individuals and columns to zones.

Create weights: one set of weights for each constraint and one for starting

weights0 <- array(dim = c(nrow(USd), nrow(all.msim)))
weights1 <- array(dim = c(nrow(USd), nrow(all.msim)))
weights2 <- array(dim = c(nrow(USd), nrow(all.msim)))
weights3 <- array(dim = c(nrow(USd), nrow(all.msim)))

weights0[, ] <- 1  # sets initial weights to 1

Create survey aggregate arrays (for direct comparison with the geographical aggregate data)

USd.agg <- array(dim = c(nrow(all.msim), ncol(all.msim)))
USd.agg1 <- array(dim = c(nrow(all.msim), ncol(all.msim)))
USd.agg2 <- array(dim = c(nrow(all.msim), ncol(all.msim)))
USd.agg3 <- array(dim = c(nrow(all.msim), ncol(all.msim)))
colnames(USd.agg1) <- category.labels

Convert survey data into wide form

This step allows the individual-level data to be compared with the aggregated data directly

USd.cat <- array(rep(0), dim = c(nrow(USd), length(category.labels != 0)))

USd.cat[which(USd$age <= 30), 1] <- 1
USd.cat[which(USd$age >= 31 & USd$age <= 50), 2] <- 1
USd.cat[which(USd$age > 50), 3] <- 1
USd.cat[which(USd$sex == "m"), 4] <- 1
USd.cat[which(USd$sex == "f"), 5] <- 1
USd.cat[which(USd$mode == "bicycle"), 6] <- 1  # Mode constraints
USd.cat[which(USd$mode == "bus"), 7] <- 1
USd.cat[which(USd$mode == "car.d"), 8] <- 1
USd.cat[which(USd$mode == "car.p"), 9] <- 1
USd.cat[which(USd$mode == "walk"), 10] <- 1
sum(USd.cat)  # Should be 30: 3 variables by 10 individuals

## [1] 30


for (i in 1:nrow(all.msim)) {
    # Loop creating aggregate values (to be repeated later)
    USd.agg[i, ] <- colSums(USd.cat * weights0[, i])
}

# Test results
USd.agg

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    3    5    2    4    6    1    1    5    1     2
## [2,]    3    5    2    4    6    1    1    5    1     2
## [3,]    3    5    2    4    6    1    1    5    1     2
## [4,]    3    5    2    4    6    1    1    5    1     2
## [5,]    3    5    2    4    6    1    1    5    1     2

all.msim

##   id age sex mode NA    NA    NA NA    NA    NA
## 1  3   3   4    5  5 0.001 1.000  8 1.000 0.001
## 2  2   2   6    4  6 0.001 3.000  5 1.000 1.000
## 3  3   4   4    3  8 1.000 2.000  5 2.000 1.000
## 4  3   3   3    7  2 2.000 1.000  3 1.000 2.000
## 5  7   2   1    6  4 7.000 0.001  2 0.001 1.000

plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg)), xlab = "Constraints", 
    ylab = "Model output")
abline(a = 0, b = 1)

plot of chunk unnamed-chunk-5

cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg)))

## [1] 0.546

Note that for USd.agg, the results are the same for every zone, as each individual has a weight of 1 for every zone.
The next stage is to apply the first constraint, to adjust the weights of each individual so they match the age constraints.

Constraint 1

for (j in 1:nrow(all.msim)) {
    weights1[which(USd$age <= 30), j] <- all.msim[j, 1]/USd.agg[j, 1]
    weights1[which(USd$age >= 31 & USd$age <= 50), j] <- all.msim[j, 2]/USd.agg[j, 
        2]
    weights1[which(USd$age >= 51), j] <- all.msim[j, 3]/USd.agg[j, 3]  ##
}
# Aggregate the results for each zone
for (i in 1:nrow(all.msim)) {
    USd.agg1[i, ] <- colSums(USd.cat * weights0[, i] * weights1[, i])
}
# Test results
USd.agg1

##      16-30 31-50 50+     m     f bicycle bus car.d car.p walk
## [1,]     3     3   4 4.600 5.400  1.0000 0.6 6.600   0.6  1.2
## [2,]     2     2   6 4.733 5.267  0.6667 0.4 7.733   0.4  0.8
## [3,]     3     4   4 4.800 6.200  1.0000 0.8 6.800   0.8  1.6
## [4,]     3     3   3 4.100 4.900  1.0000 0.6 5.600   0.6  1.2
## [5,]     7     2   1 5.567 4.433  2.3333 0.4 6.067   0.4  0.8

all.msim

##   id age sex mode NA    NA    NA NA    NA    NA
## 1  3   3   4    5  5 0.001 1.000  8 1.000 0.001
## 2  2   2   6    4  6 0.001 3.000  5 1.000 1.000
## 3  3   4   4    3  8 1.000 2.000  5 2.000 1.000
## 4  3   3   3    7  2 2.000 1.000  3 1.000 2.000
## 5  7   2   1    6  4 7.000 0.001  2 0.001 1.000

plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg1)), xlab = "Constraints", 
    ylab = "Model output")
abline(a = 0, b = 1)

plot of chunk unnamed-chunk-6

cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg1)))

## [1] 0.7944

As indicated by the plots and the correlation values, the fit between the individual-level data and the aggregate constraints (the inpute data) has been vastly improved, just by constraining by a single variable. We will perform the test after each constraint to ensure our model is improving. To see how the weights change for each individual for each area, type weights1 for constraint 1.

weights1

##       [,1]   [,2] [,3] [,4]  [,5]
##  [1,]  1.0 0.6667  1.0  1.0 2.333
##  [2,]  2.0 3.0000  2.0  1.5 0.500
##  [3,]  0.6 0.4000  0.8  0.6 0.400
##  [4,]  0.6 0.4000  0.8  0.6 0.400
##  [5,]  0.6 0.4000  0.8  0.6 0.400
##  [6,]  1.0 0.6667  1.0  1.0 2.333
##  [7,]  2.0 3.0000  2.0  1.5 0.500
##  [8,]  1.0 0.6667  1.0  1.0 2.333
##  [9,]  0.6 0.4000  0.8  0.6 0.400
## [10,]  0.6 0.4000  0.8  0.6 0.400

Constraint 2: sex

for (j in 1:nrow(all.msim)) {
    weights2[which(USd$sex == "m"), j] <- all.msim[j, 4]/USd.agg1[j, 4]
    weights2[which(USd$sex == "f"), j] <- all.msim[j, 5]/USd.agg1[j, 5]
}

for (i in 1:nrow(all.msim)) {
    USd.agg2[i, ] <- colSums(USd.cat * weights0[, i] * weights1[, i] * weights2[, 
        i])
}
# Test results
USd.agg2

##       [,1]  [,2]  [,3] [,4] [,5]   [,6]   [,7]  [,8]   [,9]  [,10]
## [1,] 3.100 2.874 4.026    5    5 0.9259 0.5556 6.755 0.5556 1.2077
## [2,] 1.886 2.161 5.953    4    6 0.7595 0.4557 7.535 0.4557 0.7937
## [3,] 2.540 4.629 3.831    3    8 1.2903 1.0323 6.113 1.0323 1.5323
## [4,] 3.823 2.004 3.173    7    2 0.4082 0.2449 6.833 0.2449 1.2693
## [5,] 7.135 1.875 0.990    6    4 2.1053 0.3609 6.381 0.3609 0.7920

all.msim

##   id age sex mode NA    NA    NA NA    NA    NA
## 1  3   3   4    5  5 0.001 1.000  8 1.000 0.001
## 2  2   2   6    4  6 0.001 3.000  5 1.000 1.000
## 3  3   4   4    3  8 1.000 2.000  5 2.000 1.000
## 4  3   3   3    7  2 2.000 1.000  3 1.000 2.000
## 5  7   2   1    6  4 7.000 0.001  2 0.001 1.000

plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg2)), xlab = "Constraints", 
    ylab = "Model output")
abline(a = 0, b = 1)

plot of chunk unnamed-chunk-8

cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg2)))

## [1] 0.8362

Again the correlation has improved. Now onto the 3rd constraint

Constraint 3: mode

for (j in 1:nrow(all.msim)) {
    weights3[which(USd$mode == "bicycle"), j] <- all.msim[j, 6]/USd.agg2[j, 
        6]
    weights3[which(USd$mode == "bus"), j] <- all.msim[j, 7]/USd.agg2[j, 7]
    weights3[which(USd$mode == "car.d"), j] <- all.msim[j, 8]/USd.agg2[j, 8]
    weights3[which(USd$mode == "car.p"), j] <- all.msim[j, 9]/USd.agg2[j, 9]
    weights3[which(USd$mode == "walk"), j] <- all.msim[j, 10]/USd.agg2[j, 10]
}
weights4 <- weights0 * weights1 * weights2 * weights3
for (i in 1:nrow(all.msim)) {
    USd.agg3[i, ] <- colSums(USd.cat * weights4[, i])
}
# Test results
USd.agg3

##        [,1]  [,2]   [,3]  [,4]  [,5]  [,6]  [,7] [,8]  [,9] [,10]
## [1,] 2.5755 2.659 4.7676 5.150 4.852 0.001 1.000    8 1.000 0.001
## [2,] 0.7486 5.302 3.9500 2.856 7.145 0.001 3.000    5 1.000 1.000
## [3,] 2.0224 5.844 3.1332 2.371 8.629 1.000 2.000    5 2.000 1.000
## [4,] 3.4992 4.108 1.3932 4.238 4.762 2.000 1.000    3 1.000 2.000
## [5,] 8.5766 1.115 0.3103 2.290 7.712 7.000 0.001    2 0.001 1.000

all.msim

##   id age sex mode NA    NA    NA NA    NA    NA
## 1  3   3   4    5  5 0.001 1.000  8 1.000 0.001
## 2  2   2   6    4  6 0.001 3.000  5 1.000 1.000
## 3  3   4   4    3  8 1.000 2.000  5 2.000 1.000
## 4  3   3   3    7  2 2.000 1.000  3 1.000 2.000
## 5  7   2   1    6  4 7.000 0.001  2 0.001 1.000

plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg3)), xlab = "Constraints", 
    ylab = "Model output")
abline(a = 0, b = 1)

plot of chunk unnamed-chunk-9

cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg3)))

## [1] 0.8588

rowSums(USd.agg3[, 1:3])  # The total population modelled for each zone, constraint 1

## [1] 10 10 11  9 10

rowSums(USd.agg3[, 4:5])

## [1] 10 10 11  9 10

rowSums(USd.agg3[, 6:10])

## [1] 10 10 11  9 10

Iterations

The correlation has been improved from one iteration to the next.
Even the after the final mode constraint, which differs greatly from the survey data for some variables, the correlation has improved. This illustrates the robustness of the IPF method. Also note that the population of each simulated zone is correct. The next step is to perform further iterations, using the results of the first iteration (weights4) as our starting point.

weights0 <- weights4

After running this command, simply run the model again, beginning from the loop at the end of the R code section titled “Convert survey data into wide form”. After running constraint 3 the second time, the correlation is higher: 0.89 instead of 0.86. Because this is a relatively simple example, the fit between constraint and simulated aggregate variables will not improve much beyond this point.

Discussion

We have run-through the IPF procedure in R. This simplified example should illustrate how the programming language is well-suited to the task, with a number of in-built functions to manipulate and analyse the data. The final correlation after…