There are two common uses for wanting the centroid of a polygon:
The second scenario can be visually very unappealing and difficult to read when labels are placed at the centre of the bounding box of the polygon, rather than the geometric centre (a.k.a. centre of mass or centre of gravity). Bounding boxes are simpler to understand and compute, but often perform unsatisfactorily, especially for polygons are very elongated and/or exhibit concavity. The geometric centre method is more complex to compute, but provides a point which more closely matches human intuition of "the centre" and generally performs better in both scenarios from a practical perspective.
The purpose of this document is to demonstrate how to compute the centroids of polygons using both methods, and illustrate the advantages of the geometric centre method over the bounding box method.
If you find any mistakes, have any suggestions for amendments or additions, or have any questions about the content, please contact me at earl.w.duncan@gmail.com.
The R code in this document requires the following packages to be installed and loaded:
library(sp) # For extent(); bbox();
library(rgeos) # For readOGR(); gIntersection(); get.bbox()
library(rgdal) # For readOGR()
library(magrittr) # For pipe operator %>%
library(ggplot2) # For fortify(); ggplot()
For illustrative purposes, I use a shapefile of small areas (known as statistical areas level 2 (SA2s)) in Australia provided by the Australian Bureau of Statistics (ABS). The shapefile is downloaded as a compressed zip file (45.4MB) to a temporary folder and extracted before being read into R.
# Download the zip file
URL <- "https://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&1270055001_sa2_2011_aust_shape.zip&1270.0.55.001&Data%20Cubes&7130A5514535C5FCCA257801000D3FBD&0&July%202011&23.12.2010&Latest.zip"
temp <- tempfile()
download.file(URL, temp)
# Rename the downloaded zip file
new.filename <- paste0(temp, ".zip")
file.rename(temp, new.filename)
temp <- new.filename
# Extract from the contents from the zip file
unzip(temp, exdir = tempdir())
map <- paste0(tempdir(), "/SA2_2011_AUST.shp") %>%
readOGR()
# Remove the temporary files and objects in memory
unlink(temp) # Delete the zip file from temp folder
rm(URL, new.filename, temp)
The object map
is a SpatialPolygonsDataFrame
which contains 2196 features (SA2s).
Initially, we will focus on a single SA2 which illustrates the differences between the two methods. The SA2 of interest has the variable SA2_5DIG11
equal to 11528. We will refer to this object as map.1
.
# Exclude all areas except the SA2 of interest
map.1 <- map[which(map$SA2_5DIG11 == 11528),]
Here is a map of the area of interest:
# Plot
map.df <- fortify(map.1)
ggplot(data = map.df, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "grey60", colour = "black") +
coord_map() +
theme_bw()
Later, we will compare the two methods by analysing a region of SA2s. This region is defined by the following code. We will refer to this object as map.2
.
map.2 <- map[grepl(SA4.names[46], map$SA4_NAME11),]
Here is a map of this region:
# Plot
map.df <- fortify(map.2)
ggplot(data = map.df, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = id), colour = "black") +
coord_map() +
theme_bw() +
scale_fill_discrete(guide = FALSE)
This type of centroid is fairly simple to compute:
\[\text{Centroid}_{BB} = \left(x_{\text{min}} + HW, y_{\text{min}} + HH\right)\]
Note that this is mathematically equivalent to:
\[\text{Centroid}_{BB} = \left(0.5(x_{\text{min}} + x_{\text{max}}), 0.5(y_{\text{min}} + y_{\text{max}})\right)\] This type of centroid can be computed using the bbox
function from the sp
package as demonstrated by the following code. This is written as a user-defined function (UDF) as we will be re-using this later. Note that the required input is a SpatialPolygonsDataFrame
.
get.centroid.bb <- function(x){
N <- length(x) # Number of polygons
# Initialise data.frame
Centroids.bb <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
for(i in 1:N){
# Bounding box of polygon
bb <- bbox(x@polygons[[i]])
# Compute centroid
Centroids.bb[i,] <- c(
0.5 * (bb[1,1] + bb[1,2]),
0.5 * (bb[2,1] + bb[2,2]))
}
return(Centroids.bb)
}
For example:
get.centroid.bb(map.1)
## long lat
## 1 151.1778 -34.04658
Note that the get.bbox()
function from the rgeos
package is similar, but it only works for spatial objects with the class gpc.poly
.
This type of centroid is defined as follows:
\[\text{Centroid}_{GC} = (C_x, C_y)\]
where
\[ C_x = \frac{1}{6A}\sum_{i=1}^{n-1}\left(x_i + x_{i+1}\right)\left(x_i y_{i+1} - x_{i+1}y_i\right) \\ C_y = \frac{1}{6A}\sum_{i=1}^{n-1}\left(y_i + y_{i+1}\right)\left(x_i y_{i+1} - x_{i+1}y_i\right) \]
and \(A\) is the area of the polygon, and \(n\) is the number of coordinates (vertices) comprising the polygon (inlcuding the duplicate of the first coordinate to close the ring).
Note that the formulae above assume that the vertices of the polygon are ordered in a anti-clockwise direction. The signs of the vertices need to be flipped if the polygon is ordered in a clockwise direction. (For ESRI shapefiles, polygon vertices are connected in a clockwise direction, unless the polygons represent "holes" in which case they are connected in an anti-clockwise direction).
To compute this type of centroid, I define a user-defined function (UDF) that implements the above equations to obtain the geometric centre of a single polygon. The area and coordinates (vertices) are easily obtained from the spatial data using the s4 extractor function @
(conveniently, the last coordinate is a duplicate of the first).
get.gc <- function(polygon){
A <- polygon@area # Area
C <- polygon@coords # Coords of each vertix (long, lat)
C1 <- C[-nrow(C),] # x_i, y_i
C2 <- C[-1,] # x_{i+1}, y_{i+1}
# Compute geometric centre
GC <- c(
sum((C1[,1] + C2[,1]) * (C1[,1]*C2[,2] - C2[,1]*C1[,2])) / (6*A),
sum((C1[,2] + C2[,2]) * (C1[,1]*C2[,2] - C2[,1]*C1[,2])) / (6*A))
# If the polygon is oriented clockwise, switch the signs
if(polygon@ringDir == 1) GC <- -GC
return(GC)
}
In practice, the polygon (area) we want to compute the centroid of may actually comprise multiple polygons. In this case, the area is called a Polygons
object, which is a list of Polygon
objects, one element for each polygon that it comprises. For example, our map.1
area consists of three polygons. We can see this by viewing the structure of this object, in particular, the polygons
slot.
str(map.1@polygons)
## List of 1
## $ :Formal class 'Polygons' [package "sp"] with 5 slots
## .. ..@ Polygons :List of 3
## .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
## .. .. .. .. ..@ labpt : num [1:2] 151 -34
## .. .. .. .. ..@ area : num 0.00263
## .. .. .. .. ..@ hole : logi FALSE
## .. .. .. .. ..@ ringDir: int 1
## .. .. .. .. ..@ coords : num [1:7075, 1:2] 151 151 151 151 151 ...
## .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
## .. .. .. .. ..@ labpt : num [1:2] 151.2 -34.1
## .. .. .. .. ..@ area : num 0.000115
## .. .. .. .. ..@ hole : logi FALSE
## .. .. .. .. ..@ ringDir: int 1
## .. .. .. .. ..@ coords : num [1:452, 1:2] 151 151 151 151 151 ...
## .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
## .. .. .. .. ..@ labpt : num [1:2] 151.1 -34.1
## .. .. .. .. ..@ area : num 2.86e-05
## .. .. .. .. ..@ hole : logi FALSE
## .. .. .. .. ..@ ringDir: int 1
## .. .. .. .. ..@ coords : num [1:394, 1:2] 151 151 151 151 151 ...
## .. ..@ plotOrder: int [1:3] 1 2 3
## .. ..@ labpt : num [1:2] 151 -34
## .. ..@ ID : chr "0"
## .. ..@ area : num 0.00277
There doesn't seem to be one obvious way to deal with the situation of having multiple polygons in a Polygons
object. Two approaches are:
Polygons
object; orTo average the centroids while retaining the geometric centre, it would be necessary to weight the centroids according the area they occupy, i.e. taking a weighted average. However, this approach may not be the best solution. If the goal is to add labels to a map, then placing the label on the largest polygon (option 2) is probably preferable over placing it at the average geometric centre, which may appear awkward. If the goal is to compute distances between centroids, then it is debatable which option is better, and this preference may vary between different Polygons
objects.
Another difficulty occurs when a Polygons
object contains "holes", which are polygons that represent areas of "zero mass". Dealing with holes is more straightforward -- we need to compute the geometric centre of any holes, and find the weighted average of these holes with the non-hole polygon(s), taking into account that holes will move the geometric centre in the opposite direction than had they not been holes.
Therefore, to compute the centroid of a Polygons
object using the geometric centre, I define the following UDF which allows the user to specify which method is to be used (either average
(default) or largest
). If there are multiple Polygon
elements, then the geometric centre is computed for each one, including holes, and these are averaged as discussed above (if method = average
, otherwise holes are ignored -- I may change this in a future update).
get.centroid.gc <- function(x, method = "average"){
N <- length(x) # Number of polygons
# Initialise data.frame
Centroids.cg <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
for(i in 1:N){
Polys <- x@polygons[[i]]
n <- length(Polys@Polygons)
if(n == 1){
Centroids.cg[i,] <- get.gc(Polys@Polygons[[1]])
}else{
A <- sapply(Polys@Polygons, function(x) x@area) # Area of each polygon
D <- sapply(Polys@Polygons, function(x) x@ringDir) # Ring direction
if(tolower(method) %in% c("l", "largest")){
L <- which.max(A[which(D == 1)]) # Largest area of non-hole polygon
Centroids.cg[i,] <- get.gc(Polys@Polygons[[L]])
}else if(tolower(method) %in% c("a", "av", "average", "w", "weighted")){
GC.part <- matrix(NA, n, 2)
for(j in 1:n){
Poly <- Polys@Polygons[[j]]
GC.part[j,] <- get.gc(Poly)
}
# Weighted average distance between geometric centres
A <- A * D
A <- A / sum(A)
Centroids.cg[i,] <- A %*% GC.part
}
}
}
return(Centroids.cg)
}
For example:
get.centroid.gc(map.1)
## long lat
## 1 151.1798 -34.03417
If you are using polygons obtained from a shapefile as in the example above, then as part of the spatial information contained within the shapefile, there are some default centroids. These centroids are used as the default point location when plotting labels, and thus the slot is appropriately named labpt
. These centroids may be equivalent to one of the two methods above, or may be some other arbitrary point - it is up to the discretion of the creator of the shapefile.
The following code extracts these labpt
values:
get.centroid.shp <- function(x){
N <- length(x) # Number of polygons
# Initialise data.frame
Centroids.shp <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
for(i in 1:N){
Centroids.shp[i,] <- x@polygons[[i]]@labpt
}
return(Centroids.shp)
}
For example:
get.centroid.shp(map.1)
## long lat
## 1 151.1816 -34.03142
For plotting purposes, it would be nice to see the bounding box as well as the centroids. Thus I first define another UDF which acquires the bounding box for each Polygons
object. To avoid confusion with the function from the rgeos
packages, I call this UDF get.bbox2
.
get.bbox2 <- function(x){
N <- length(x) # Number of polygons
# Initialise data.frame
bb <- data.frame(matrix(NA, N, 4, dimnames = list(NULL, c("xmin", "ymin", "xmax", "ymax"))))
for(i in 1:N){
# Bounding box of polygon
bb[i,] <- c(bbox(x@polygons[[i]]))
}
return(bb)
}
Here is a visual comparison of the methods for map.1
.
# Get bounding boxes and centroids
bboxes <- get.bbox2(map.1)
dat <- rbind(
get.centroid.bb(map.1),
get.centroid.gc(map.1, "average"),
get.centroid.gc(map.1, "largest"),
get.centroid.shp(map.1)
)
# Add plot ID to Centroids.bb and bboxes for facet_wrap
map.df <- fortify(map.1)
N <- length(map.1)
dat$Method <- rep(c(
"Bounding box",
"Geometric centre (average)",
"Geometric centre (largest)",
"Shapefile default"),
each = N
)
# Plot
ggplot(data = map.df) +
geom_polygon(aes(x = long, y = lat, group = group), fill = "grey60", colour = "black") +
coord_map() +
theme_bw() +
# theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
geom_point(data = dat, aes(x = long, y = lat, colour = Method, shape = Method), size = 2) +
scale_colour_manual(values = c("red", "blue", "cyan", "black")) +
geom_rect(data = bboxes, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
colour = "red", fill = NA)
Here is a visual comparison of the methods for map.2
.
# Get centroids
dat <- rbind(
get.centroid.bb(map.2),
get.centroid.gc(map.2, "average"),
get.centroid.gc(map.2, "largest"),
get.centroid.shp(map.2)
)
# Add plot ID to Centroids.bb and bboxes for facet_wrap
map.df <- fortify(map.2)
dat$id <- unique(map.df$id)
N <- length(map.2)
dat$Method <- rep(c(
"Bounding box",
"Geometric centre (average)",
"Geometric centre (largest)",
"Shapefile default"),
each = N
)
# Plot
ggplot(data = map.df) +
geom_polygon(aes(x = long, y = lat, group = group, fill = id), colour = "black") +
coord_map() +
theme_bw() +
# theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
# facet_wrap(~ id) +
geom_point(data = dat, aes(x = long, y = lat, colour = Method, shape = Method), size = 2) +
scale_colour_manual(values = c("red", "blue", "cyan", "black")) +
scale_fill_discrete(guide = FALSE)