A common difficulty with spatial data is that they are often “aggregated”, that is, generalized from their original spatial scale (i.e. high detail) into a smaller spatial scale (i.e. low detail). The best example is a demographic census which is surveyed at household level but reported at several levels of aggregation such as blocks, sectors, municipalities, provinces and departments (O’Sullivan & Unwin, 2010).
The problem is that the aggregation units are arbitrary with respect to the phenomena under investigation, yet the units used will affect statistics determined on the basis of data reported in this way. This difficulty is referred to as the modifiable areal unit problem (MAUP). If the spatial units in a particular study were specified differently, users would observe very different patterns and relationships (O’Sullivan & Unwin, 2010).
Let’s illustrate MAUP using the R language and enviroment for statistical computing and graphics. First, let’s create two raster variables, x (i.e. the independent variable) and y (i.e. the dependent variable). You can create and display these rasters using this code:
if(!require(raster)){
install.packages("raster")
library(raster)
}
## Loading required package: raster
## Loading required package: sp
x <- c(87,95,72,37,44,24,40,55,55,38,88,34,41,30,26,35,38,24,14,56,37,34,8,18,49,44,51,67,17,37,55,25,33,32,59,54)
y <- c(72,75,85,29,58,30,50,60,49,46,84,23,21,46,22,42,45,14,19,36,48,23,8,29,38,47,52,52,22,48,58,40,46,38,35,55)
mx <- matrix(ncol=6, nrow=6, x)
my <- matrix(ncol=6, nrow=6, y)
rx <- raster(mx)
ry <- raster(my)
par(mfrow=c(1,2))
plot(rx,main="Independent raster (x)")
text(rx)
plot(ry,main="Dependent raster (y)")
text(ry)
A linear regression model can be used for such a purpose, for example:
model_1 <- lm(ry[] ~ rx[])
summary(model_1)
##
## Call:
## lm(formula = ry[] ~ rx[])
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.303 -8.536 3.293 7.466 20.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.37497 4.12773 2.513 0.0169 *
## rx[] 0.75435 0.08668 8.703 3.61e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 34 degrees of freedom
## Multiple R-squared: 0.6902, Adjusted R-squared: 0.6811
## F-statistic: 75.74 on 1 and 34 DF, p-value: 3.605e-10
plot(model_1)
Let’s aggregate the raster by a 2-factor only in vertical direction (i.e. the resulting raster will have six colummns and three rows):
rx1 <- aggregate(rx, fact=c(1,2), fun=mean, expand=T)
ry1 <- aggregate(ry, fact=c(1,2), fun=mean, expand=T)
par(mfrow=c(1,2))
plot(rx1,main="Independent raster (x1)")
text(rx1)
plot(ry1,main="Dependent raster (y1)")
text(ry1)
Now, let’s create a new linear regression model:
model_2 <- lm(ry1[] ~ rx1[])
summary(model_2)
##
## Call:
## lm(formula = ry1[] ~ rx1[])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.884 -3.621 -1.827 5.501 9.119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.58992 3.74759 3.626 0.00227 **
## rx1[] 0.67982 0.08096 8.398 2.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.768 on 16 degrees of freedom
## Multiple R-squared: 0.8151, Adjusted R-squared: 0.8035
## F-statistic: 70.52 on 1 and 16 DF, p-value: 2.94e-07
Let’s aggregate the raster by a 2-factor only in horizontal direction (i.e. the resulting raster will have three colummns and six rows):
rx2 <- aggregate(rx, fact=c(2,1), fun=mean, expand=T)
ry2 <- aggregate(ry, fact=c(2,1), fun=mean, expand=T)
par(mfrow=c(1,2))
plot(rx2,main="Independent raster (x2)")
text(rx2)
plot(ry2,main="Dependent raster (y2)")
text(ry2)
Now, let’s create another linear regression model:
model_3 <- lm(ry2[] ~ rx2[])
summary(model_3)
##
## Call:
## lm(formula = ry2[] ~ rx2[])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4540 -3.2961 -0.8083 4.1461 8.9260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.25704 3.88449 0.324 0.75
## rx2[] 0.96571 0.08491 11.373 4.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.487 on 16 degrees of freedom
## Multiple R-squared: 0.8899, Adjusted R-squared: 0.883
## F-statistic: 129.4 on 1 and 16 DF, p-value: 4.456e-09
Usually, as shown here, regression relationships are strenghtened by aggregation. In fact, using a simulation approach, Openshaw and Taylor (1979) showed that with the same underlying data, it is possible to aggregate units together in ways that produce correlations anywhere between -1.0 and +1.0 (O’Sullivan & Unwin, 2010).
The practical implications of MAUP are immense for almost all decision-making processes involving GIS technology, since with the availability of aggregated maps, policy could easily focus on issues and problems which might look different if the aggregation scheme used were changed (O’Sullivan & Unwin, 2010).
If you want to see more go to http://gispopsci.org/maup/ and have a look
O’Sullivan, D. and Unwin, D.J., 2010, Geographic Information Analysis Second Edition. John Wiley & Sons. 405 pp.