\( \newcommand{\bm}[1]{\boldsymbol{#1}} \)

Preface

Aim

There are two common uses for wanting the centroid of a polygon:

  1. To compute the distance between pairs of polygons; and
  2. To plot labels, such as county/suburb names, in the "middle" of the polygon when creating a map.

The second scenario can be visually very unappealing and difficult to read when labels are placed at the centre of the bounding box of the polygon, rather than the geometric centre (a.k.a. centre of mass or centre of gravity). Bounding boxes are simpler to understand and compute, but often perform unsatisfactorily, especially for polygons are very elongated and/or exhibit concavity. The geometric centre method is more complex to compute, but provides a point which more closely matches human intuition of "the centre" and generally performs better in both scenarios from a practical perspective.

The purpose of this document is to demonstrate how to compute the centroids of polygons using both methods, and illustrate the advantages of the geometric centre method over the bounding box method.

If you find any mistakes, have any suggestions for amendments or additions, or have any questions about the content, please contact me at earl.w.duncan@gmail.com.

R Libraries

The R code in this document requires the following packages to be installed and loaded:

library(sp)             # For extent(); bbox();
library(rgeos)          # For readOGR(); gIntersection(); get.bbox()
library(rgdal)          # For readOGR()
library(magrittr)       # For pipe operator %>%
library(ggplot2)        # For fortify(); ggplot()

1 Introduction

1.1 Importing a Spatial Object (Map)

For illustrative purposes, I use a shapefile of small areas (known as statistical areas level 2 (SA2s)) in Australia provided by the Australian Bureau of Statistics (ABS). The shapefile is downloaded as a compressed zip file (45.4MB) to a temporary folder and extracted before being read into R.

# Download the zip file
URL <- "https://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&1270055001_sa2_2011_aust_shape.zip&1270.0.55.001&Data%20Cubes&7130A5514535C5FCCA257801000D3FBD&0&July%202011&23.12.2010&Latest.zip"
temp <- tempfile()
download.file(URL, temp)

# Rename the downloaded zip file
new.filename <- paste0(temp, ".zip")
file.rename(temp, new.filename)
temp <- new.filename

# Extract from the contents from the zip file
unzip(temp, exdir = tempdir())
map <- paste0(tempdir(), "/SA2_2011_AUST.shp") %>%
    readOGR()

# Remove the temporary files and objects in memory
unlink(temp) # Delete the zip file from temp folder
rm(URL, new.filename, temp)

The object map is a SpatialPolygonsDataFrame which contains 2196 features (SA2s).

1.2 Isolating Areas of Interest

Initially, we will focus on a single SA2 which illustrates the differences between the two methods. The SA2 of interest has the variable SA2_5DIG11 equal to 11528. We will refer to this object as map.1.

# Exclude all areas except the SA2 of interest
map.1 <- map[which(map$SA2_5DIG11 == 11528),]

Here is a map of the area of interest:

# Plot
map.df <- fortify(map.1)
ggplot(data = map.df, aes(x = long, y = lat, group = group)) +
    geom_polygon(fill = "grey60", colour = "black") +
    coord_map() +
    theme_bw()

Later, we will compare the two methods by analysing a region of SA2s. This region is defined by the following code. We will refer to this object as map.2.

map.2 <- map[grepl(SA4.names[46], map$SA4_NAME11),]

Here is a map of this region:

# Plot
map.df <- fortify(map.2)
ggplot(data = map.df, aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = id), colour = "black") +
    coord_map() +
    theme_bw() +
    scale_fill_discrete(guide = FALSE)

2 Computing Centroids

2.1 The Bounding Box Method

This type of centroid is fairly simple to compute:

  1. Determine the bounding box of the polygon. In typical Cartesian fashion, let the lower-left coordinate defining this box be labelled \(x_{\text{min}}\) and \(y_{\text{min}}\), and the upper-right coordinate be defined be labelled \(x_{\text{max}}\) and \(y_{\text{max}}\).
  2. Compute the half-width \(HW = 0.5(x_{\text{max}} - x_{\text{min}})\) and half-height \(HH = 0.5(y_{\text{max}} - y_{\text{min}})\).
  3. Add the half-width and half-height to the lower-left coordinate:

\[\text{Centroid}_{BB} = \left(x_{\text{min}} + HW, y_{\text{min}} + HH\right)\]

Note that this is mathematically equivalent to:

\[\text{Centroid}_{BB} = \left(0.5(x_{\text{min}} + x_{\text{max}}), 0.5(y_{\text{min}} + y_{\text{max}})\right)\] This type of centroid can be computed using the bbox function from the sp package as demonstrated by the following code. This is written as a user-defined function (UDF) as we will be re-using this later. Note that the required input is a SpatialPolygonsDataFrame.

get.centroid.bb <- function(x){
    N <- length(x)  # Number of polygons
    # Initialise data.frame
    Centroids.bb <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
    for(i in 1:N){
        # Bounding box of polygon
        bb <- bbox(x@polygons[[i]])
        # Compute centroid
        Centroids.bb[i,] <- c(
            0.5 * (bb[1,1] + bb[1,2]),
            0.5 * (bb[2,1] + bb[2,2]))
    }
    return(Centroids.bb)
}

For example:

get.centroid.bb(map.1)
##       long       lat
## 1 151.1778 -34.04658

Note that the get.bbox() function from the rgeos package is similar, but it only works for spatial objects with the class gpc.poly.

2.2 The Geometric Centre Method

This type of centroid is defined as follows:

\[\text{Centroid}_{GC} = (C_x, C_y)\]

where

\[ C_x = \frac{1}{6A}\sum_{i=1}^{n-1}\left(x_i + x_{i+1}\right)\left(x_i y_{i+1} - x_{i+1}y_i\right) \\ C_y = \frac{1}{6A}\sum_{i=1}^{n-1}\left(y_i + y_{i+1}\right)\left(x_i y_{i+1} - x_{i+1}y_i\right) \]

and \(A\) is the area of the polygon, and \(n\) is the number of coordinates (vertices) comprising the polygon (inlcuding the duplicate of the first coordinate to close the ring).

Note that the formulae above assume that the vertices of the polygon are ordered in a anti-clockwise direction. The signs of the vertices need to be flipped if the polygon is ordered in a clockwise direction. (For ESRI shapefiles, polygon vertices are connected in a clockwise direction, unless the polygons represent "holes" in which case they are connected in an anti-clockwise direction).

To compute this type of centroid, I define a user-defined function (UDF) that implements the above equations to obtain the geometric centre of a single polygon. The area and coordinates (vertices) are easily obtained from the spatial data using the s4 extractor function @ (conveniently, the last coordinate is a duplicate of the first).

get.gc <- function(polygon){
    A <- polygon@area   # Area
    C <- polygon@coords # Coords of each vertix (long, lat)
    C1 <- C[-nrow(C),]  # x_i, y_i
    C2 <- C[-1,]        # x_{i+1}, y_{i+1}
    # Compute geometric centre
    GC <- c(
        sum((C1[,1] + C2[,1]) * (C1[,1]*C2[,2] - C2[,1]*C1[,2])) / (6*A),
        sum((C1[,2] + C2[,2]) * (C1[,1]*C2[,2] - C2[,1]*C1[,2])) / (6*A))
    # If the polygon is oriented clockwise, switch the signs
    if(polygon@ringDir == 1) GC <- -GC
    return(GC)
}

In practice, the polygon (area) we want to compute the centroid of may actually comprise multiple polygons. In this case, the area is called a Polygons object, which is a list of Polygon objects, one element for each polygon that it comprises. For example, our map.1 area consists of three polygons. We can see this by viewing the structure of this object, in particular, the polygons slot.

str(map.1@polygons)
## List of 1
##  $ :Formal class 'Polygons' [package "sp"] with 5 slots
##   .. ..@ Polygons :List of 3
##   .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
##   .. .. .. .. ..@ labpt  : num [1:2] 151 -34
##   .. .. .. .. ..@ area   : num 0.00263
##   .. .. .. .. ..@ hole   : logi FALSE
##   .. .. .. .. ..@ ringDir: int 1
##   .. .. .. .. ..@ coords : num [1:7075, 1:2] 151 151 151 151 151 ...
##   .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
##   .. .. .. .. ..@ labpt  : num [1:2] 151.2 -34.1
##   .. .. .. .. ..@ area   : num 0.000115
##   .. .. .. .. ..@ hole   : logi FALSE
##   .. .. .. .. ..@ ringDir: int 1
##   .. .. .. .. ..@ coords : num [1:452, 1:2] 151 151 151 151 151 ...
##   .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
##   .. .. .. .. ..@ labpt  : num [1:2] 151.1 -34.1
##   .. .. .. .. ..@ area   : num 2.86e-05
##   .. .. .. .. ..@ hole   : logi FALSE
##   .. .. .. .. ..@ ringDir: int 1
##   .. .. .. .. ..@ coords : num [1:394, 1:2] 151 151 151 151 151 ...
##   .. ..@ plotOrder: int [1:3] 1 2 3
##   .. ..@ labpt    : num [1:2] 151 -34
##   .. ..@ ID       : chr "0"
##   .. ..@ area     : num 0.00277

There doesn't seem to be one obvious way to deal with the situation of having multiple polygons in a Polygons object. Two approaches are:

  1. Compute the geometric centre for each polygon and then somehow "average" these to get the centroid for the entire Polygons object; or
  2. Only compute the geometric centre for the largest polygon and ignore the others.

To average the centroids while retaining the geometric centre, it would be necessary to weight the centroids according the area they occupy, i.e. taking a weighted average. However, this approach may not be the best solution. If the goal is to add labels to a map, then placing the label on the largest polygon (option 2) is probably preferable over placing it at the average geometric centre, which may appear awkward. If the goal is to compute distances between centroids, then it is debatable which option is better, and this preference may vary between different Polygons objects.

Another difficulty occurs when a Polygons object contains "holes", which are polygons that represent areas of "zero mass". Dealing with holes is more straightforward -- we need to compute the geometric centre of any holes, and find the weighted average of these holes with the non-hole polygon(s), taking into account that holes will move the geometric centre in the opposite direction than had they not been holes.

Therefore, to compute the centroid of a Polygons object using the geometric centre, I define the following UDF which allows the user to specify which method is to be used (either average (default) or largest). If there are multiple Polygon elements, then the geometric centre is computed for each one, including holes, and these are averaged as discussed above (if method = average, otherwise holes are ignored -- I may change this in a future update).

get.centroid.gc <- function(x, method = "average"){
    N <- length(x)  # Number of polygons
    # Initialise data.frame
    Centroids.cg <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
    for(i in 1:N){
        Polys <- x@polygons[[i]]
        n <- length(Polys@Polygons)
        if(n == 1){
            Centroids.cg[i,] <- get.gc(Polys@Polygons[[1]])
        }else{
            A <- sapply(Polys@Polygons, function(x) x@area) # Area of each polygon
            D <- sapply(Polys@Polygons, function(x) x@ringDir)  # Ring direction
            if(tolower(method) %in% c("l", "largest")){
                L <- which.max(A[which(D == 1)]) # Largest area of non-hole polygon
                Centroids.cg[i,] <- get.gc(Polys@Polygons[[L]])
            }else if(tolower(method) %in% c("a", "av", "average", "w", "weighted")){
                GC.part <- matrix(NA, n, 2)
                for(j in 1:n){
                    Poly <- Polys@Polygons[[j]]
                    GC.part[j,] <- get.gc(Poly)
                }
                # Weighted average distance between geometric centres
                A <- A * D
                A <- A / sum(A)
                Centroids.cg[i,] <- A %*% GC.part
            }
        }
    }
    return(Centroids.cg)
}

For example:

get.centroid.gc(map.1)
##       long       lat
## 1 151.1798 -34.03417

2.3 Shapefile Default Centroids

If you are using polygons obtained from a shapefile as in the example above, then as part of the spatial information contained within the shapefile, there are some default centroids. These centroids are used as the default point location when plotting labels, and thus the slot is appropriately named labpt. These centroids may be equivalent to one of the two methods above, or may be some other arbitrary point - it is up to the discretion of the creator of the shapefile.

The following code extracts these labpt values:

get.centroid.shp <- function(x){
    N <- length(x)  # Number of polygons
    # Initialise data.frame
    Centroids.shp <- data.frame(matrix(NA, N, 2, dimnames = list(NULL, c("long", "lat"))))
    for(i in 1:N){
        Centroids.shp[i,] <- x@polygons[[i]]@labpt
    }
    return(Centroids.shp)
}

For example:

get.centroid.shp(map.1)
##       long       lat
## 1 151.1816 -34.03142

3 Comparing the Methods

For plotting purposes, it would be nice to see the bounding box as well as the centroids. Thus I first define another UDF which acquires the bounding box for each Polygons object. To avoid confusion with the function from the rgeos packages, I call this UDF get.bbox2.

get.bbox2 <- function(x){
    N <- length(x)  # Number of polygons
    # Initialise data.frame
    bb <- data.frame(matrix(NA, N, 4, dimnames = list(NULL, c("xmin", "ymin", "xmax", "ymax"))))
    for(i in 1:N){
        # Bounding box of polygon
        bb[i,] <- c(bbox(x@polygons[[i]]))
    }
    return(bb)
}

Here is a visual comparison of the methods for map.1.

# Get bounding boxes and centroids
bboxes <- get.bbox2(map.1)
dat <- rbind(
    get.centroid.bb(map.1), 
    get.centroid.gc(map.1, "average"), 
    get.centroid.gc(map.1, "largest"), 
    get.centroid.shp(map.1)
)

# Add plot ID to Centroids.bb and bboxes for facet_wrap
map.df <- fortify(map.1)
N <- length(map.1)
dat$Method <- rep(c(
    "Bounding box", 
    "Geometric centre (average)", 
    "Geometric centre (largest)", 
    "Shapefile default"), 
    each = N
)

# Plot
ggplot(data = map.df) +
    geom_polygon(aes(x = long, y = lat, group = group), fill = "grey60", colour = "black") +
    coord_map() +
    theme_bw() +
    # theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    geom_point(data = dat, aes(x = long, y = lat, colour = Method, shape = Method), size = 2) +
    scale_colour_manual(values = c("red", "blue", "cyan", "black")) +
    geom_rect(data = bboxes, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
        colour = "red", fill = NA)

Here is a visual comparison of the methods for map.2.

# Get centroids
dat <- rbind(
    get.centroid.bb(map.2), 
    get.centroid.gc(map.2, "average"), 
    get.centroid.gc(map.2, "largest"), 
    get.centroid.shp(map.2)
)

# Add plot ID to Centroids.bb and bboxes for facet_wrap
map.df <- fortify(map.2)
dat$id <- unique(map.df$id)
N <- length(map.2)
dat$Method <- rep(c(
    "Bounding box", 
    "Geometric centre (average)", 
    "Geometric centre (largest)", 
    "Shapefile default"), 
    each = N
)

# Plot
ggplot(data = map.df) +
    geom_polygon(aes(x = long, y = lat, group = group, fill = id), colour = "black") +
    coord_map() +
    theme_bw() +
    # theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    # facet_wrap(~ id) +
    geom_point(data = dat, aes(x = long, y = lat, colour = Method, shape = Method), size = 2) +
    scale_colour_manual(values = c("red", "blue", "cyan", "black")) +
    scale_fill_discrete(guide = FALSE)

Back to top.