This is an R Markdown document to demonstrate how iterative proportional fitting (IPF) can be used to perform spatial microsimulation: sampling from an individual-level dataset to fill areas about which only aggregated values are known. This procedure provides the context for a paper entitled “‘Truncate, replicate, sample’: a method for creating integer weights for spatial microsimulation”.
Spatial microsimulation requires two input datasets: individual-level data, where rows represent individuals, and geographically aggregated data, where rows represent areas. The follow code creates example datasets, based on a survey of 10 individuals and 5 small areas. The spatial microsimulation model will select individuals based on age, sex and mode of transport. For consistency with the (larger) model used for the paper, we will refer to the individual-level data as USd (short for Understanding Society dataset) and the geographic data as all.msim (all constraint variables).
# Read in the data in long form (normaly read.table() used)
c.names <- c("id", "age", "sex", "mode")
USd <- c(1, 27, "m", "car.d", 2, 54, "m", "car.d", 3, 35, "f", "bus", 4, 42,
"m", "walk", 5, 50, "f", "car.p", 6, 19, "m", "car.d", 7, 62, "f", "car.d",
8, 21, "f", "bicycle", 9, 38, "f", "walk", 10, 48, "f", "car.d")
USd <- matrix(USd, nrow = 10, byrow = T) # Convert long data into matrix, by row
USd <- data.frame(USd) # Convert this into a dataframe
names(USd) <- c.names # Add correct column names
USd$age <- as.numeric(levels(USd$age)[USd$age]) # Age is a numeric variable
USd # Show the data frame in R
## id age sex mode
## 1 1 27 m car.d
## 2 2 54 m car.d
## 3 3 35 f bus
## 4 4 42 m walk
## 5 5 50 f car.p
## 6 6 19 m car.d
## 7 7 62 f car.d
## 8 8 21 f bicycle
## 9 9 38 f walk
## 10 10 48 f car.d
# Read in the data in long form (normaly read.table() used)
category.labels <- c("16-30", "31-50", "50+" # Age constraint
,"m", "f" # Sex constraint
,"bicycle", "bus", "car.d", "car.p", "walk" # Mode constraint
)
all.msim <- c( 3, 3, 4, 5, 5, 0.001, 1, 8, 1, 0.001, # Car dominated
2, 2, 6, 4, 6, 0.001, 3, 5, 1, 1, # Elderly
3, 4, 4, 3, 8, 1, 2, 5, 2, 1, # Female dominated
3, 3, 3, 7, 2, 2, 1, 3, 1, 2, # Male dominated
7, 2, 1, 6, 4, 7, 0.001, 2, 0.001, 1 # Many cyclists, young
)
all.msim <- matrix(all.msim, nrow = 5, byrow = T) # Convert long data into matrix, by row
all.msim <- data.frame(all.msim) # Convert this into a dataframe
names(all.msim) <- c.names # Add correct column names
all.msim # Show the data frame in R
## id age sex mode NA NA NA NA NA NA
## 1 3 3 4 5 5 0.001 1.000 8 1.000 0.001
## 2 2 2 6 4 6 0.001 3.000 5 1.000 1.000
## 3 3 4 4 3 8 1.000 2.000 5 2.000 1.000
## 4 3 3 3 7 2 2.000 1.000 3 1.000 2.000
## 5 7 2 1 6 4 7.000 0.001 2 0.001 1.000
# Check totals for each constraint match
rowSums(all.msim[,1:3]) # Age constraint
## [1] 10 10 11 9 10
rowSums(all.msim[,4:5]) # Sex constraint
## [1] 10 10 11 9 10
rowSums(all.msim[,6:10])# Mode constraint
## [1] 10 10 11 9 10
Iterative proportional fitting will determine the weight allocated to each individual for each zone to best match the geographically aggregated data. A weight matrix is therefore created, with rows corresponding to individuals and columns to zones.
weights0 <- array(dim = c(nrow(USd), nrow(all.msim)))
weights1 <- array(dim = c(nrow(USd), nrow(all.msim)))
weights2 <- array(dim = c(nrow(USd), nrow(all.msim)))
weights3 <- array(dim = c(nrow(USd), nrow(all.msim)))
weights0[, ] <- 1 # sets initial weights to 1
USd.agg <- array(dim = c(nrow(all.msim), ncol(all.msim)))
USd.agg1 <- array(dim = c(nrow(all.msim), ncol(all.msim)))
USd.agg2 <- array(dim = c(nrow(all.msim), ncol(all.msim)))
USd.agg3 <- array(dim = c(nrow(all.msim), ncol(all.msim)))
colnames(USd.agg1) <- category.labels
This step allows the individual-level data to be compared with the aggregated data directly
USd.cat <- array(rep(0), dim = c(nrow(USd), length(category.labels != 0)))
USd.cat[which(USd$age <= 30), 1] <- 1
USd.cat[which(USd$age >= 31 & USd$age <= 50), 2] <- 1
USd.cat[which(USd$age > 50), 3] <- 1
USd.cat[which(USd$sex == "m"), 4] <- 1
USd.cat[which(USd$sex == "f"), 5] <- 1
USd.cat[which(USd$mode == "bicycle"), 6] <- 1 # Mode constraints
USd.cat[which(USd$mode == "bus"), 7] <- 1
USd.cat[which(USd$mode == "car.d"), 8] <- 1
USd.cat[which(USd$mode == "car.p"), 9] <- 1
USd.cat[which(USd$mode == "walk"), 10] <- 1
sum(USd.cat) # Should be 30: 3 variables by 10 individuals
## [1] 30
for (i in 1:nrow(all.msim)) {
# Loop creating aggregate values (to be repeated later)
USd.agg[i, ] <- colSums(USd.cat * weights0[, i])
}
# Test results
USd.agg
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3 5 2 4 6 1 1 5 1 2
## [2,] 3 5 2 4 6 1 1 5 1 2
## [3,] 3 5 2 4 6 1 1 5 1 2
## [4,] 3 5 2 4 6 1 1 5 1 2
## [5,] 3 5 2 4 6 1 1 5 1 2
all.msim
## id age sex mode NA NA NA NA NA NA
## 1 3 3 4 5 5 0.001 1.000 8 1.000 0.001
## 2 2 2 6 4 6 0.001 3.000 5 1.000 1.000
## 3 3 4 4 3 8 1.000 2.000 5 2.000 1.000
## 4 3 3 3 7 2 2.000 1.000 3 1.000 2.000
## 5 7 2 1 6 4 7.000 0.001 2 0.001 1.000
plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg)), xlab = "Constraints",
ylab = "Model output")
abline(a = 0, b = 1)
cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg)))
## [1] 0.546
Note that for USd.agg, the results are the same for every zone, as each individual has a weight of 1 for every zone.
The next stage is to apply the first constraint, to adjust the weights of each individual so they match the age constraints.
for (j in 1:nrow(all.msim)) {
weights1[which(USd$age <= 30), j] <- all.msim[j, 1]/USd.agg[j, 1]
weights1[which(USd$age >= 31 & USd$age <= 50), j] <- all.msim[j, 2]/USd.agg[j,
2]
weights1[which(USd$age >= 51), j] <- all.msim[j, 3]/USd.agg[j, 3] ##
}
# Aggregate the results for each zone
for (i in 1:nrow(all.msim)) {
USd.agg1[i, ] <- colSums(USd.cat * weights0[, i] * weights1[, i])
}
# Test results
USd.agg1
## 16-30 31-50 50+ m f bicycle bus car.d car.p walk
## [1,] 3 3 4 4.600 5.400 1.0000 0.6 6.600 0.6 1.2
## [2,] 2 2 6 4.733 5.267 0.6667 0.4 7.733 0.4 0.8
## [3,] 3 4 4 4.800 6.200 1.0000 0.8 6.800 0.8 1.6
## [4,] 3 3 3 4.100 4.900 1.0000 0.6 5.600 0.6 1.2
## [5,] 7 2 1 5.567 4.433 2.3333 0.4 6.067 0.4 0.8
all.msim
## id age sex mode NA NA NA NA NA NA
## 1 3 3 4 5 5 0.001 1.000 8 1.000 0.001
## 2 2 2 6 4 6 0.001 3.000 5 1.000 1.000
## 3 3 4 4 3 8 1.000 2.000 5 2.000 1.000
## 4 3 3 3 7 2 2.000 1.000 3 1.000 2.000
## 5 7 2 1 6 4 7.000 0.001 2 0.001 1.000
plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg1)), xlab = "Constraints",
ylab = "Model output")
abline(a = 0, b = 1)
cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg1)))
## [1] 0.7944
As indicated by the plots and the correlation values, the fit between the individual-level data and the aggregate constraints (the inpute data) has been vastly improved, just by constraining by a single variable. We will perform the test after each constraint to ensure our model is improving. To see how the weights change for each individual for each area, type weights1 for constraint 1.
weights1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0 0.6667 1.0 1.0 2.333
## [2,] 2.0 3.0000 2.0 1.5 0.500
## [3,] 0.6 0.4000 0.8 0.6 0.400
## [4,] 0.6 0.4000 0.8 0.6 0.400
## [5,] 0.6 0.4000 0.8 0.6 0.400
## [6,] 1.0 0.6667 1.0 1.0 2.333
## [7,] 2.0 3.0000 2.0 1.5 0.500
## [8,] 1.0 0.6667 1.0 1.0 2.333
## [9,] 0.6 0.4000 0.8 0.6 0.400
## [10,] 0.6 0.4000 0.8 0.6 0.400
for (j in 1:nrow(all.msim)) {
weights2[which(USd$sex == "m"), j] <- all.msim[j, 4]/USd.agg1[j, 4]
weights2[which(USd$sex == "f"), j] <- all.msim[j, 5]/USd.agg1[j, 5]
}
for (i in 1:nrow(all.msim)) {
USd.agg2[i, ] <- colSums(USd.cat * weights0[, i] * weights1[, i] * weights2[,
i])
}
# Test results
USd.agg2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3.100 2.874 4.026 5 5 0.9259 0.5556 6.755 0.5556 1.2077
## [2,] 1.886 2.161 5.953 4 6 0.7595 0.4557 7.535 0.4557 0.7937
## [3,] 2.540 4.629 3.831 3 8 1.2903 1.0323 6.113 1.0323 1.5323
## [4,] 3.823 2.004 3.173 7 2 0.4082 0.2449 6.833 0.2449 1.2693
## [5,] 7.135 1.875 0.990 6 4 2.1053 0.3609 6.381 0.3609 0.7920
all.msim
## id age sex mode NA NA NA NA NA NA
## 1 3 3 4 5 5 0.001 1.000 8 1.000 0.001
## 2 2 2 6 4 6 0.001 3.000 5 1.000 1.000
## 3 3 4 4 3 8 1.000 2.000 5 2.000 1.000
## 4 3 3 3 7 2 2.000 1.000 3 1.000 2.000
## 5 7 2 1 6 4 7.000 0.001 2 0.001 1.000
plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg2)), xlab = "Constraints",
ylab = "Model output")
abline(a = 0, b = 1)
cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg2)))
## [1] 0.8362
Again the correlation has improved. Now onto the 3rd constraint
for (j in 1:nrow(all.msim)) {
weights3[which(USd$mode == "bicycle"), j] <- all.msim[j, 6]/USd.agg2[j,
6]
weights3[which(USd$mode == "bus"), j] <- all.msim[j, 7]/USd.agg2[j, 7]
weights3[which(USd$mode == "car.d"), j] <- all.msim[j, 8]/USd.agg2[j, 8]
weights3[which(USd$mode == "car.p"), j] <- all.msim[j, 9]/USd.agg2[j, 9]
weights3[which(USd$mode == "walk"), j] <- all.msim[j, 10]/USd.agg2[j, 10]
}
weights4 <- weights0 * weights1 * weights2 * weights3
for (i in 1:nrow(all.msim)) {
USd.agg3[i, ] <- colSums(USd.cat * weights4[, i])
}
# Test results
USd.agg3
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2.5755 2.659 4.7676 5.150 4.852 0.001 1.000 8 1.000 0.001
## [2,] 0.7486 5.302 3.9500 2.856 7.145 0.001 3.000 5 1.000 1.000
## [3,] 2.0224 5.844 3.1332 2.371 8.629 1.000 2.000 5 2.000 1.000
## [4,] 3.4992 4.108 1.3932 4.238 4.762 2.000 1.000 3 1.000 2.000
## [5,] 8.5766 1.115 0.3103 2.290 7.712 7.000 0.001 2 0.001 1.000
all.msim
## id age sex mode NA NA NA NA NA NA
## 1 3 3 4 5 5 0.001 1.000 8 1.000 0.001
## 2 2 2 6 4 6 0.001 3.000 5 1.000 1.000
## 3 3 4 4 3 8 1.000 2.000 5 2.000 1.000
## 4 3 3 3 7 2 2.000 1.000 3 1.000 2.000
## 5 7 2 1 6 4 7.000 0.001 2 0.001 1.000
plot(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg3)), xlab = "Constraints",
ylab = "Model output")
abline(a = 0, b = 1)
cor(as.vector(as.matrix(all.msim)), as.vector(as.matrix(USd.agg3)))
## [1] 0.8588
rowSums(USd.agg3[, 1:3]) # The total population modelled for each zone, constraint 1
## [1] 10 10 11 9 10
rowSums(USd.agg3[, 4:5])
## [1] 10 10 11 9 10
rowSums(USd.agg3[, 6:10])
## [1] 10 10 11 9 10
The correlation has been improved from one iteration to the next.
Even the after the final mode constraint, which differs greatly from the survey data for some variables, the correlation has improved. This illustrates the robustness of the IPF method. Also note that the population of each simulated zone is correct. The next step is to perform further iterations, using the results of the first iteration (weights4) as our starting point.
weights0 <- weights4
After running this command, simply run the model again, beginning from the loop at the end of the R code section titled “Convert survey data into wide form”. After running constraint 3 the second time, the correlation is higher: 0.89 instead of 0.86. Because this is a relatively simple example, the fit between constraint and simulated aggregate variables will not improve much beyond this point.
We have run-through the IPF procedure in R. This simplified example should illustrate how the programming language is well-suited to the task, with a number of in-built functions to manipulate and analyse the data. The final correlation after…