Developing a Huff retail gravity model in R

author: Michalis Pavlis

During the last few months, I have been working with colleagues Alex Singleton and Les Dolega on a CDRC project with the objective to estimate the extent and volume of potential patronage flows between retail centres, that is the catchment area of retail centres. An output of our work is the huff-tools library in R that can be used to develop a Huff retail gravity model, as I will describe in this post. The Huff model can be used to obtain the probability of a customer from an origin location i patronising a retail centre in destination j. Based on the Huff model, the probability is greater the closer a customer is to the retail centre and the more attractive that retail centre is, or a bit more formally:

P_ij = A_j^a D_ij^-b / Σ₁^N A_j^a D_ij^-b
Where P_ij the probability of a consumer from a location i patronising a retail centre j, A_j a measure of attractiveness of the retail centre j, D_ij a measure of distance between i and j, a (alpha) the exponent of attractiveness, b (beta) the exponent of distance and N the number of retail centres. The denominator of the Huff model provides a way of standardising the numerator so that the sum of probabilities for each point of origin add up to 1 and, hence, it allows to consider the effect of competition between retail centres.

It is trivial to implement this in R, for example we can calculate the numerator of the Huff algorithm as follows:

huff_numerator <- function(destinations_attractiveness, alpha, distance, beta){
    return((destinations_attractiveness ^ alpha) / (distance ^ beta))
  }
huff <- mapply(huff_numerator, destinations_attractiveness, alpha, distance, beta)

And then we can use the aggregate function in R to calculate the denominator by summing the huff values conditional on origin locations (i.e. sum of all huff values between a given origin and all destinations within the study area). Consequently, the huff probabilities can be obtained like so:

# Assuming we have the origin location names as vector in R (origins_name)
sum_huff_location <- aggregate(huff, by = list(origins_name), sum)
# Rename the fields of the data frame
names(sum_huff_location) <- c("origins_name", "sum_huff")
# Merge numerator and denominator
out <- merge(data.frame(origins_name, destinations_name, distance, huff), sum_huff_location, by="origins_name") 
# Calculate the huff probabilities
out$huff_probability <- with(out, huff / sum_huff)

The most interesting part of the ananlysis was to estimate the values of the parameters to plug in the Huff model and particularly (for my part of the project) to obtain a measure of distance between origin and destination locations. First, I will give a short description of the way we developed some of the parameters of the Huff model and then I will mostly focus on obtaining the distance parameter of the Huff model. Given that we wanted to apply the model at the national scale in England, we first selected as origin locations the centroids of LSOAs and as destinations all town centres and regional shopping centres in England. For the attractiveness score of the destinations, Les developed a composite indicator of attractiveness using data related to the number of retail stores and businesses in each town centre (e.g. number of anchor stores, number of leisure businesses). Concerning the alpha and beta parameters, it is advised to be estimated using actual data of customer flows, although the alpha parameter is often ignored by setting it equal to 1.

With regard to selecting a measure of distance, the time and money costs associated with travel is the most useful proxy, followed by the shortest road distance and the Euclidean (straight) distance. Given the absence of data on cost of travel we only considered the latter two distance metrics. To calculate the Euclidean distance, learning how to use the gDistance function in rgeos is a good starting point. For our purposes it was possible to obtain all pairwise origin-destination distances by setting the “byid” argument to TRUE. More specifically, to calculate the Euclidean distances (in kilometres) between all LSOAs in the origins dataset (polygons) and all retail centres in the destinations dataset (points) you can run the following script in R:

library(rgeos)
distance <- as.numeric(t(gDistance(destinations, origins, byid = T))) / 1000

Calculating the road distances is a bit more complicated. First of all, a road network dataset is required and for England the Meridian 2 road network is a good option. After collecting the road data, it is useful to remove any parts of the road network which are self-connected (this will save us from the frustration of dealing with NULL values later on). The easiest way to do this is by converting the road network to a graph with the igraph library in R and then using either the Depth First Search method or the Breadth First Search method to traverse the graph. First things first. We need to import the data in R and for this we will use the fastshp library (with rgdal it would take a few hours to load the data in R). We will select to import the road netwrok as a list, which will make it easier to extract the information we need later on.

roads <- read.shp("/path/to/meridian2.shp", "list")

A bit of terminology to make things clear. The Meridian 2 road network is a spatial dataset and its spatial information is in the form of pairs of coordinates representing points that define the road lines, which combined together create the road network. Each point of a line is termed node and the lines of the road network can have shared endpoint nodes. The important thing here is that we only need to extract the coordinates of these endpoints. Within the list we created with the fastshp library, each road line is an individual entry and the coordinates of each line are part of that entry and stored in two matrices. The matrix denoted as x has all the easting values of the nodes defining a line while the y matrix has all the northing values. Remember that we only need the first and last entries of each matrix and then to bind them together to obtain the coordinates of the endpoints.

# Function to extract the coordinates of the endpoints
get_endpoints <- function(roads_list){
    n <- length(roads_list$x)
    return(do.call("cbind", list(paste(roads_list$x[1], roads_list$y[1]), paste(roads_list$x[n], roads_list$y[n]))))
  }
# Loop over the entries of the list (i.e. the road lines) and use the get_endpoints function to extract the coordinates of the endpoints for each line
endpoints <- do.call("rbind", lapply(roads, get_endpoints))

The script above returns a matrix with two columns of characters. Each column contains the pair of coordinates of one endpoint and each row stores the first and last endpoint that define a single road line. We can then use this matrix to create a graph with the graph.edgelist method from the igraph library. The first argument of the method is our matrix while the second argument defines whether the graph should be directed or not (we cannot create a directed graph with our data).

roads_graph <- graph.edgelist(edges, directed = F)

The next step is to traverse the graph in order to select the main body of the road network and remove all the other self-connected parts. The easiest way to do this is by using the decompose.graph method from the igraph library, but it is also possible to traverse the graph with our own implementation of the Breadth First Search method (BFS) in R. The BFS method takes as input a graph, which in the following example is represented as an adjacency list. An adjacency list can be created in R as a named list with the index of each object of the list being a unique endpoint (as character of pair of coordinates) and the values being all the endpoints which are connected with that endpoint. Consequently, we take the entries of the first object of the list and place them in a vector (denoted as stack in the script below) and as long as the vector has entries, we take its last item and use it as index in the adjacency list in order to extract all its neighbouring endpoints and put them in the stack vector. Every endpoint that we remove from the stack and lookup in the adjacency list we mark it as visited (by placing it in the visited vector in the script below) so as to not place it in the stack vector again. If at some point the stack vector is empty we check whether we have marked all the indices of the adjacency list as visited and if not we repeat the same process for the endpoints that we haven't visited yet, thus decomposing the graph into its self-connected components. In R this could be done with the following script.

# Build a graph as adjacency list
adjacency_list <- list()
unique_endpoints <- unique(as.character(endpoints)) # These will be the indices of the adjacency list
for (i in unique_endpoints){
  # Map each index with its neighbouring endpoints
  adjacency_list[[i]] <- c(endpoints[which(endpoints[,1] == i), 2], edges[which(endpoints[,2] == i), 1])
}

graph_parts <- list() # To store graph components
x <- 1 # To keep track of graph components

while (length(unique_endpoints) > 1){ # For as long as there unvisited endpoints in the graph

  visited <- NULL
  stack <- unique_endpoints[1]

  while (length(stack) != 0){ # For as long as the stack is not empty
    endpoint <- stack[length(stack)] # Get the last item in the stack
    stack <- stack[-length(stack)] # And then remove it

    if (! endpoint %in% visited){ # If we haven't visited this endpoint, PROBABLY NOT REQUIRED
      visited <- c(visited, endpoint) # Mark it as visited
      neighbours <- adjacency_list[[endpoint]] # And extract its neighbours from the graph
      stack <- c(stack, neighbours[! neighbours %in% visited]) # Add any edges we haven't visited in the stack 
    }
  }
  graph_parts[[x]] <- visited # Add graph component in the list
  unique_endpoints <- unique_endpoints[! unique_endpoints %in% visited] # Remove the graph component we traversed
  x <- x + 1
}

Of course, we can obtain essentially the same output using the much faster decompose.graph function from the igraph library.

graph_parts <- decompose.graph(roads_graph)

We can then check if the graph has indeed more than one components, extract the graph component with the greatest number of endpoints and link those with the endpoints of the list we created with fastshp.

if (length(graph_parts) > 1){
    # Find road part with the greatest number of endpoints 
    graph_endpoints <- V(graph_parts[[which.max(sapply(1:length(graph_parts), function(x) length(V(graph_parts[[x]]))))]])$name
    # Map selected graph endpoints to road network endpoints
    idxs <- unique(c(which(endpoints[,1] %in% graph_endpoints), which(endpoints[,2] %in% graph_endpoints)))

    print(paste("There are", length(graph_parts), "components in the road network"))
    # Extract main road component
    roads <- roads[idxs]

What is left to do is to convert the selected road network back to a shapefile. This can be accomplished with the following function.

proj4roads <- ogrInfo(path_to_shp, shp_name)$p4s
convert_to_SpatialLinesDF <- function(roads_list, proj4roads){
    lines_sp <- vector("list", length(roads_list))
    for (i in 1:length(lines_sp)){
      lines_sp[[i]] <- Lines(list(Line(cbind(roads_list[[i]]$x, roads_list[[i]]$y))), as.character(i))
    }
    return(SpatialLinesDataFrame(SpatialLines(lines_sp, proj4string = CRS(proj4roads)), data.frame(ID = 1:length(lines_sp))))
  }

We can now plot the part of the road network that was selected and the part that was removed with the following script.

connected <- convert_to_SpatialLinesDF(out, proj4roads)
plot(connected)
lines(convert_to_SpatialLinesDF(roads[-idxs]), col = "red", lwd = 1.5)
      title(main = paste("There are", length(graph_parts), "unconnected parts in the road network"),
            sub = "Lines marked with red were removed")

To calculate the shortest road distances we need to build a graph network from the cleaned road network and then use the shortest.paths function from the igraph library that implements Dijkstra's algorithm. Unlike the previous graph, however, we will extract all nodes that define each road line. Consequently, the graph can be built as above, only this time we will weight each segment of the graph by its length.

get_nodes <- function(roads_list){
    n <- length(roads_list$x)
    return(do.call("cbind", list(roads_list$x[-n], roads_list$y[-n], roads_list$x[-1], roads_list$y[-1])))
  }

# Create a matrix with column 1 providing x_1, column 2 y_1, column 3 x_2 and column 4 y_2
nodes <- do.call("rbind", lapply(roads, get_nodes))

# To build the graph, we need a matrix with two columns of characters, the first is the start (x,y) the second is the end (x,y)
igraph_object <- graph.edgelist(cbind(paste(nodes[,1], nodes[,2]), paste(nodes[,3], nodes[,4])), directed = FALSE)

# Weight graph by road segment length
E(igraph_object)$weight <- sqrt((nodes[, 1] - nodes[, 3])^2 + (nodes[, 2] - nodes[, 4])^2) / 1000

Before we can calculate the distances between origin points and destination points we have to map these points on the closest nodes of the graph we just created. For this we will use the get.knnx function from the FNN library which returns the k nearest neighbours from a point based on Euclidean distance.

# Extract graph nodes
xy <- do.call(rbind, strsplit(V(igraph_object)$name, " "))

# Convert entries of matrix from character to numeric
storage.mode(xy) <- "numeric"

# Map origin and destination points to closest graph nodes
origins_id <- get.knnx(xy, coordinates(origins), 1)$nn.index[,1]
destinations_id <- get.knnx(xy, coordinates(destinations), 1)$nn.index[,1]

We are now ready to calculate the road distances with the shortest.paths function. Three inputs are required to use the method, i.e. the graph, the origin nodes (origins_id) the destination nodes (destinations_id). We only need to make sure that there are no duplicate ids in the third argument of the method, as shown below.

destinations_nr <- nrow(destinations@data)
origins_nr <- nrow(origins@data)
destinations_name <- destinations@data$Name
origins_name <- origins@data$Name

# If we don't have duplicate origin IDs we can run the shortest.paths function
if (anyDuplicated(origins_id) == 0){
  # Build output dataframe
  out <- data.frame(origins_name = rep(origins_name, destinations_nr), destinations_name = as.character(sapply(destinations_name, rep, origins_nr)),
                    distance = rep(NA, origins_nr * destinations_nr), stringsAsFactors = F)
  for (i in 0:(destinations_nr - 1)){
    # use the shortest.paths function
    out[(i * origins_nr + 1):((i + 1) * origins_nr), "distance"] <- as.numeric(shortest.paths(igraph_object, v = destinations_id[i+1], to = origins_id))
  }

} else {
  # If we have duplicate origins select unique origins to calculate the distance to
  unique_origins_id <- unique(origins_id)
  out <- data.frame(origins_name = rep("None", destinations_nr * origins_nr), destinations_name = as.character(sapply(destinations_name, rep, origins_nr)),
                    distance = rep(NA, origins_nr * destinations_nr), stringsAsFactors = F)
  for (i in 0:(destinations_nr - 1)){
    # use the shortest.paths function, we will merge the output for unique origins with the complete set of origins
    out[(i * origins_nr + 1):((i + 1) * origins_nr), c(1,3)] <- merge(data.frame(origins_id = origins_id, origins_name = origins_name, stringsAsFactors = F),
                                                                  data.frame(origins_id = unique_origins_id, distance =
                                                                               as.numeric(shortest.paths(igraph_object, v = destinations_id[i+1], to = unique_origins_id))),
                                                                  by = "origins_id", sort = F, stringsAsFactors = F)[,2:3]
  }
}

The reason we selected to use the origins_id for the third argument of the function (and we looped over the destinations instead of origins), is because the destinations are usually fewer than the origins, hence we have to do fewer loops and the script will execute faster. In general, there is not difference between selecting either the origins or destinations for the last two arguments of the shortest.paths function. One improvement that we could make to this script is to calculate the distances to the boundary of retail centres instead of their centroid. To do this, we could create a shapefile of points of each retail centre boundary and then use the dplyr library to group the outcome by destination and origin in order to take the minimum distance.

library(dplyr)

polygons_to_points <- function(polys, name){

  get_coords <- function(polys, idx){
    return(do.call("cbind", list(polys@Polygons[[1]]@coords, idx)))
  }

  if (nrow(polys) > 1){
    out <- do.call("rbind", mapply(get_coords, polys@polygons, 1:nrow(polys)))
  } else {
    out <- cbind(polys@polygons[[1]]@Polygons[[1]]@coords, 1)
  }

  polys@data[[name]] <- as.character(polys@data[[name]])

  out <- as.data.frame(out)
  names(out) <- c("easting", "northing", "idx")
  out <- merge(out, data.frame(idx = 1:nrow(polys), Names = polys@data[[name]], stringsAsFactors = F), by = "idx") 
  out <- out[,-1]
  coordinates(out) =~ easting + northing
  proj4string(out) <- proj4string(polys)

  return(out)
}
# Extract boundary points
destinations_pnt <- polygons_to_points(destinations, "Name") # Then use shortest.paths function as shown above

# From the calculated distances between the boundary points of the retail centres and the centroids of the origins take the minimum distance
out <- out %>% 
      group_by(destinations_name, origins_name) %>%
      summarize(min(distance)) %>%
      as.data.frame()