Geomarketing with R

Disclaimer: The contents of this document come from 13. Geomarketing of Geocomputation with R (Lovelace, Nowosad, & Muenchow, 2019). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.

This document is also published on RPubs.

library(tidyverse)
library(sf)
library(raster)
library(osmdata) # help download detailed geographic data from OpenStreetMap
library(spDataLarge)
library(tmap)
library(revgeo) # for reverse geocoding 
library(osmdata)
library(classInt)

13.1 Introduction

geomarketing, aka location analysis or location intelligence.
Find out where ‘optimal locations’ are for specific services, based on available data.
Examples of questions are:
- Where do target groups live and which areas do they frequent?
- Where are competing stores or services located?
- How many people can easily reach specific stores?
- Do existing services over- or under-exploit the market potential?
- What is the market share of a company in a specific area?

13.2 Case study: bike shops in Germany

Our scenario

Imagine you are starting a chain of bike shops in Germany.
The stores should be placed in urban areas with as many potential customers as possible.
Single young males (aged 20 to 40) are most likely to buy your products: this is the target audience.

Suggested Workflow

Tidy the input data from the German census (Section 13.3).
Convert the tabulated census data into raster objects (Section 13.4).
Identify metropolitan areas with high population densities (Section 13.5).
Download detailed geographic data (from OpenStreetMap, with osmdata) for these areas (Section 13.6).
Create rasters for scoring the relative desirability of different locations using map algebra (Section 13.7).

13.3 Tidy the input data

First, download, unzip, and read in a German Census csv file.

download.file("https://tinyurl.com/ybtpkwxz", 
              destfile = "census.zip", mode = "wb")
unzip("census.zip") # unzip the files
census_de = readr::read_csv2(list.files(pattern = "Gitter.csv"))
# https://stackoverflow.com/questions/22970091/difference-between-read-csv-and-read-csv2-in-r

An alternative is to use spDataLarge.

data("census_de", package = "spDataLarge")

census_de: a data frame containing 13 variables for more than 300,000 grid cells across Germany.

Data manipulation tasks

dplyr::select and rename (in English) input variables.
- Easting (x)
- Northing (y)
- number of inhabitants (population; pop)
- proportion of women (women)
- mean average age (mean_age)
- average household size (hh_size)
For these selected variables, recode values from -1 to -9 as NA.

# pop = population, hh_size = household size
input = dplyr::select(census_de, x = x_mp_1km, y = y_mp_1km, pop = Einwohner,
                      women = Frauen_A, mean_age = Alter_D,
                      hh_size = HHGroesse_D)
# set -1 and -9 to NA
# https://dplyr.tidyverse.org/reference/mutate_all.html
input_tidy = mutate_all(input, list(~ifelse(. %in% c(-1, -9), NA, .)))
# input_tidy = mutate_all(input, ~ifelse(. %in% c(-1, -9), NA, .)) # this line does the same task

Interestingly, input_tidy provides class codes, instead of actual counts or %, for each variable like the following. Now, we will recode them in the next steps.

13.4 Create census rasters

Step1. rasterFromXYZ()

Requires an input data frame where the first two columns represent coordinates on a regular grid.
All the remaining columns (here: pop, women, mean_age, hh_size) will serve as input for the raster brick layers.

input_ras = rasterFromXYZ(input_tidy, crs = st_crs(3035)$proj4string)

input_ras # check the min/max values for each layer

## class       : RasterBrick 
## dimensions  : 868, 642, 557256, 4  (nrow, ncol, ncell, nlayers)
## resolution  : 1000, 1000  (x, y)
## extent      : 4031000, 4673000, 2684000, 3552000  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## data source : in memory
## names       : pop, women, mean_age, hh_size 
## min values  :   1,     1,        1,       1 
## max values  :   6,     5,        5,       5

epsg:3035

Lambert Equal Area Europe
A projected CRS where each grid cell has the same area, here 1000 x 1000 square meters

Step2. reclassify()

Reclassify the values of the rasters stored in input_ras according to the cutoff values of Table 13.1 below.
We learned reclassify() in 4.3.3 Local operations.
Here, we need a reclassification matrix, each row of which contains the lower bound, upper bound, and new cell value.

rcl_pop = matrix(c(1, 1, 127, 2, 2, 375, 3, 3, 1250, 
                   4, 4, 3000, 5, 5, 6000, 6, 6, 8000), 
                 ncol = 3, byrow = TRUE)
rcl_women = matrix(c(1, 1, 3, 2, 2, 2, 3, 3, 1, 4, 5, 0), 
                   ncol = 3, byrow = TRUE)
rcl_age = matrix(c(1, 1, 3, 2, 2, 0, 3, 5, 0),
                 ncol = 3, byrow = TRUE)
rcl_hh = rcl_women
rcl = list(rcl_pop, rcl_women, rcl_age, rcl_hh)

reclass = input_ras
for (i in seq_len(nlayers(reclass))) {
  reclass[[i]] = reclassify(x = reclass[[i]], rcl = rcl[[i]], right = NA)
}
names(reclass) = names(input_ras)

Check what it means by right = NA above: RDocumentation reclassify.

reclass # again, check the min/max values for each layer

## class       : RasterBrick 
## dimensions  : 868, 642, 557256, 4  (nrow, ncol, ncell, nlayers)
## resolution  : 1000, 1000  (x, y)
## extent      : 4031000, 4673000, 2684000, 3552000  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## data source : in memory
## names       :  pop, women, mean_age, hh_size 
## min values  :  127,     0,        0,       0 
## max values  : 8000,     3,        3,       3

13.5 Define metropolitan areas

Here, we identify metropolitan areas. Note that the input raster layers have the cell size of 1 km by 1 km, which is too small/detailed to delineate metropolitan areas. Thus, we first aggregate the input rater layers by the factor of 20.

pop_agg = aggregate(reclass$pop, fact = 20, fun = sum)

Next, keep only those cells with more than half a million people (now, the cell size is 20 km by 20 km).

pop_agg = pop_agg[pop_agg > 500000, drop = FALSE] # try what happens with drop = TRUE

Plotting this reveals eight metropolitan regions.
Each region consists of one or more raster cells.
Let’s join all cells belonging to one region by raster::clump().

polys = pop_agg %>% 
  clump() %>%
  rasterToPolygons() %>%
  st_as_sf()

Note that polys is an sf object (polygon) with 21 features. For now, neighboring polygons are separate entities. We will spatially dissolve them by clumps (i.e., a temporarily created categorical variable to denote individual metropolitan areas) via tidyverse functions: group_by() and summarize(). When no variable name is specified in summarize(), it geometrically dissolves an input data.

metros = polys %>%
  group_by(clumps) %>%
  summarize()

Still, the metroplitan names are missing in metros. To add the names, we reverse-geocode, which is a task similar to geocoding by in the opposite direction (input: coordinates -> output: addresses).

The revgeo package provides access to the open source Photon geocoder for OpenStreetMap, Google Maps and Bing. By default, it uses the Photon API.
revgeo::revgeo() only accepts geographical coordinates (latitude/longitude); therefore, the first requirement is to bring the metropolitan polygons into an appropriate coordinate reference system.

metros_wgs = st_transform(metros, 4326)
coords = st_centroid(metros_wgs) %>%
  st_coordinates() %>%
  round(4)

Now, let’s reverse geocode with given coordinates.

metro_names = revgeo(longitude = coords[, 1], latitude = coords[, 2], 
                     output = "frame") # this returns a data frame

Alternative way to extract the metro names are:

# attach metro_names from spDataLarge
data("metro_names", package = "spDataLarge")

One minor manual edit: Wülfrath to Duesseldorf (no Umlaut).

metro_names = dplyr::pull(metro_names, city) %>% 
  as.character() %>% 
  ifelse(. == "Wülfrath", "Duesseldorf", .)

13.6 Points of interest

The osmdata package provides easy-to-use access to Open Street Map (OSM) data.
Instead of downloading shops for the whole of Germany, we restrict the query to the defined metropolitan areas, reducing computational load and providing shop locations only in areas of interest.
The subsequent code chunk does this using a number of functions including:
- map() (the tidyverse equivalent of lapply()), which iterates through all eight metropolitan names which subsequently define the bounding box in the OSM query function opq().
- add_osm_feature() to specify OSM elements with a key value of shop (see wiki.openstreetmap.org for a list of common key:value pairs).
- osmdata_sf(), which converts the OSM data into spatial objects (of class sf).
- while(), which tries repeatedly (three times in this case) to download the data if it fails the first time. Before running this code: please consider it will download almost 2GB of data.

shops = map(
  metro_names, 
  function(x) {
    message("Downloading shops of: ", x, "\n")
    
    # give the server a bit time
    Sys.sleep(sample(seq(5, 10, 0.1), 1))
    
    # specify a query with a bounding box and a key   
    query = opq(x) %>%
      add_osm_feature(key = "shop")
    
    # Return an OSM query in sf format 
    points = osmdata_sf(query)
    
    # request the same data again if nothing has been downloaded 
    iter = 2
    while (nrow(points$osm_points) == 0 & iter > 0) {
      points = osmdata_sf(query)
      iter = iter - 1
    }
    # 
    points = st_set_crs(points$osm_points, 4326)
  }
)

Once the above API querries are finished, check if all eight metropolitan areas have at least some shops.
Since these metros are the largest in Germany, it’s unlikely that any of them has no shop.

# checking if we have downloaded shops for each metropolitan area
ind = map(shops, nrow) == 0
if (any(ind)) {
  message("There are/is still (a) metropolitan area/s without any features:\n",
          paste(metro_names[ind], collapse = ", "), "\nPlease fix it!")
}

Once all metros have some shops, further clean the results by selecting two columns.

# select only specific columns
shops = map(shops, dplyr::select, osm_id, shop)
# putting all list elements into a single data frame
shops = do.call(rbind, shops)

An alternative to save time and resources is to use a preprocessed data set from spDataLarge:

data("shops", package = "spDataLarge")

As the last step, convert the spatial point object into a raster.
The sf obejct, shops, is converted to a raster layer with the same parameters (dimensions, resolution, CRS) as the reclass object from 13.4.

# to match crs of shops and reclass
shops = st_transform(shops, proj4string(reclass)) 
# create poi raster
poi = rasterize(x = shops, y = reclass, field = "osm_id", fun = "count")

As with the other raster layers (pop, women, mean_age, and hh_size), the poi raster is reclassified into four classes.
Here, we choose the Fisher-Jenks natural breaks approach, which minimizes within-class variance.

# construct reclassification matrix
int = classInt::classIntervals(values(poi), n = 4, style = "fisher")
int = round(int$brks)
rcl_poi = matrix(c(int[1], rep(int[-c(1, length(int))], each = 2), 
                   int[length(int)] + 1), ncol = 2, byrow = TRUE)
rcl_poi = cbind(rcl_poi, 0:3)  
# reclassify
poi = reclassify(poi, rcl = rcl_poi, right = NA) 
names(poi) = "poi"

13.7 Identifying suitable locations

Now, before identifying the best locations for bike shops, add the poi raster layer to the reclass raster stack object, and remove the pop layer from it.

# add poi raster
reclass = addLayer(reclass, poi)
# delete population raster
reclass = dropLayer(reclass, "pop")

With clean data, the final step is to calculate a final score by summing all raster layers.

# calculate the total score
result = sum(reclass)
# create a boolena raster, cropped by the Berlin metropolitan area 
good_for_bikeshops <- crop(result>9, metros[2, ])

tmap_mode("view")
tm_basemap("OpenStreetMap.Mapnik") +
  tm_shape(metros[2, ]) + # Berlin
  tm_borders(lwd = 0) +   # zero line width for setting up the mapping area 
  tm_shape(good_for_bikeshops) +
  tm_raster(alpha = 0.5)

Geomarketing with `R`

April 15, 2019 (Monday)