Introduction

This document presents the set of codes accompanying the book “Modelling Spatial Density. Data, Methods and R Applications in Statistics, Econometrics, and Machine Learning” published by Oxford University Press (Kopczewska, 2025). The codes are organised as chapters in the book and allow for full reproducibility of the analyses, figures, tables, and statistics presented. The technical explanations are included in the code comments, while the full methodological details can be found in the book.

Reading packages and data

All data used in these examples can be downloaded from GitHub. The codes mainly use the sf class for spatial data handling and occasionally the sp class where necessary.

First, one needs to prepare a working directory and download packages:

# STARTER – Reading the data and necessary packages
# Description of the data sets is in Appendix 1

# Download all necessary packages from CRAN - once for each package
# STARTER – Reading the data and necessary packages
# Description of the data sets is in Appendix 1

# Download all necessary packages from CRAN - once for each package
install.packages("sf")   # sf is essential for spatial data
install.packages("ggplot2") # ggplot2 is crucial for nice graphics

# Read the universal packages (when downloaded)
library(sf)     # to operate in sf class
library(ggplot2)        # for plotting

# Set working directory – please set your own path
setwd("C:/your_workig_directory_folder")

The next point is to read in and/or generate contours that provide a spatial reference for the data being analysed:

# CONTOURS #
# Read NUTS2 contour map (shapefile) into sf class
NTS2<-st_read("wojewodztwa.shp")        # all 16 regions of Poland
NTS2<-st_transform(NTS2, 4326)      # reproject to WGS84
# separate one NTS2 region – mazowieckie (Mazovian region)
region<-NTS2[NTS2$jpt_nazwa=="mazowieckie", ] # only Mazovian region

# Read NUTS4 contour map (shapefile) into sf class
NTS4<-st_read("pow_maz.shp")    # few NUTS4 within 1 NUTS2 region
NTS4<-st_transform(NTS4, 4326)  # reproject to WGS84
# cut single NUTS4/LAU1 poviat – city of Warsaw (Warszawa)
Warszawa<-NTS4[NTS4$jpt_nazwa_=="powiat Warszawa",]

# own-created boundary – box with side of length=1, from sf::
# polygon in sfc class as an analytical area for generated points
# values in a brackets in cbind() are coordinates (x,y)
# start at the bottom left corner (0,0) and then go right to the
# bottom right corner (1,0), then up to the top right corner (1,1), 
# then left to the top left corner (0,1) and then down to (0,0)

polygon<-st_sfc(st_polygon(list(cbind(c(0,1,1,0,0),c(0,0,1,1,0)))))
poly.sf<-st_as_sf(polygon, crs=4326) # sf class, projected

plot(st_geometry(polygon)) # plot the box
axis(1); axis(2)            # plot the x and y axes

Then all the data needs to be imported into R:

# DATASETS #
# Read point data on firms into data.frame class and sf class
firms<-read.table("firms_data.txt") # read into data.frame class
head(firms) # check what the data.frame object looks like
dim(firms)      # check dimensions – number of rows and columns
summary(firms)  # check statistics and if data are numeric

# Convert to sf class
firms.sf<-st_as_sf(firms, coords=c("lon", "lat"), crs=4326) 

# Object that stores the IDs of columns with location
crds.col<-9:10  # columns with lon (x)  lat (y) in firms object

# sampling to get subset – by randomly selecting rows
nn<-2000                        # size of subsample
selector<-sample(1:dim(firms)[1], nn, replace=FALSE) # selected rows
sub.firms<-firms[selector, ]        # subsample of firms in df
sub.firms.sf<-firms.sf[selector, ]  # subsample of firms in sf

# Read population point data into data.frame and sf classes
popul<-read.csv("popul_data.csv", header=TRUE, dec=".", sep=",")
popul.sf<-st_as_sf(popul, coords=c("x", "y"), crs=4326)

# Sampling to get a subset – by randomly selecting rows
nn<-5000
selector.popul<-sample(1:dim(popul)[1], nn, replace=FALSE)
sub.popul<-popul[selector.popul, ]  # subsample of population df
sub.popul.sf<-popul.sf[selector.popul,] # subsample of population sf

# NTS4 data
NTS4_data<-read.csv("NTS4_data.csv", header=TRUE, dec=".", sep=",")
summary(NTS4_data)

When the data is imported, it is important to check that the data has been read correctly.

# Output of objects
# point data on firms in class sf
firms.sf        # inspection of what the dataset looks like

# population point data in class data.frame
head(popul) # show first six rows of data in data.frame class

# population point data in class sf
popul.sf        # see the first ten rows of data in class sf

# regional poviat (NTS4) data in class data.frame
head(NTS4_data)

# shapefile map for poviats (NTS4)
# Note that poviat_name from NTS4_data object is in the same order 
# as jpt_nazwa_ from NTS4 object – this ensures correct linking 
head(NTS4)

Codes for Chapter 1

1.3 Spatial density in the social sciences

This is the code for plotting points and their density in a various ways. Typically, plotting geolocated points and their density is the first step in density analysis. The plot() command is the simplest, but very efficient, especially for large data sets. More advanced visualisations, but for smaller samples, can be obtained using ggplot() and additional functions from the ggdensity:: package. Useful functions are: geom_hdr() which performs 2D density estimation and finds the highest density regions (hdr), option adjust which sets the parameters for KDE, stat_hdr() which finds the highest density regions (hdr) with given probabilities. Colours can be set using for example viridis:: package with scale_fill_viridis() function – these palettes allow efficient greyscale conversion.

# Fig.3 - plotting points and their density

# Fig.3a – Simple plot of all points
plot(firms[, crds.col], pch=".", axes=FALSE)
axis(1); axis(2)        # semicolon replaces putting code in next line

# alternative solution in ggplot()
ggplot(sub.firms, aes(x=lon, y=lat)) +  geom_point(size=0.5)+ theme(legend.position="none") + labs(title="Location of firms", x="Longitude", y="Latitude") 

# Fig.3b - Hexagonal density plot for the subsample
d<-ggplot(sub.firms, aes(lon, lat)) # save as object, do not display
d + geom_hex() + theme(legend.position="none") + scale_fill_viridis(option="H") # plot as hexagons

# Fig.3c – Kernel density estimation (KDE) plot
install.packages("ggdensity")           # run it once only
library(ggdensity)
d + geom_hdr(method=method_kde(adjust=1/2)) + theme(legend.position="none")

# Fig.3d – Plot of density polygons 
ggplot(data=sub.firms, aes(lon, lat)) +  geom_point(alpha=0.1) + stat_hdr(probs=c(0.8, 0.5), aes(fill=after_stat(probs)), color="black", alpha=0.8) + scale_fill_brewer(type="seq") + theme(legend.position="none")

Figure 3: Visualisations of a business density: a) point data, b) hexagonal representation, c) kernel density, d) points with density polygons

Codes for Chapter 2

2.1 K-means, self-organising maps and hierarchical clustering of geo-coordinates

This is the code for clustering of geo-coordinates using three algorithms: k-means, self-organising maps and hierarchical tree. K-means clustering is an algorithm which is available in a basic built-in package stats:: as kmeans() command – this package does not need command library() to run first. Like most clustering functions, it outputs two main pieces of information: the clustering vector (in slot cluster) - a vector of natural numbers that assigns each observation to one of the clusters, and centroids - values of variables used in clustering that are most central in a given cluster. The number of clusters is set by the user. Here one clusters geo-coordinates that are in the class matrix. In addition, the stats:: package includes dist(), hclust() and cutree() which are needed for hierarchical clustering – dist() creates the Euclidean distance matrix; hclust() builds the whole tree and cutree() cuts the dendrogram correctly to assign observations to clusters. Other clustering algorithms such as SOM require additional packages – here the kohonen:: package is used.

R has many packages that define colours for plots – one of them is the tempR:: package, which allows for creating a nice set with pretty_palette() command. Other solutions will be presented with the next examples.

Plotting clusters in colours is possible with the simple plot() function. Additional layers – centroids of clusters – have been added in yellow. For k-means and SOM, the centroids are shown in clustering objects. For hierarchical tree they were computed with sf_multipoint() from sfheaders:: package in a for() loop for each cluster – as sf_multipoint outputs in sf class, st_coordinates() and st_centroid() commands were needed to retrieve geocoordinates to store them in data.frame class.

# Fig.5 - Clustering of geo-coordinates and plotting
# Colours of clusters are according to cluster ID

# define colour palette
install.packages("tempR")           # run it once only
library(tempR)
pretty_palette<-pretty_palette(4)   # to get four colours
pretty_palette                  # check vector of colours 
#[1] "red"         "lightblue"   "forestgreen" "purple"     

library(viridis)
viri_cols<-viridis(n=4, option="F")
viri_cols
#[1] "#03051AFF" "#841E5AFF" "#F06043FF" "#FAEBDDFF"

# Clustering for Figures 5 and 6

# k-means clustering with stats:: (4 clusters, subsample data set)
kmeans_model<-kmeans(as.matrix(sub.firms[,crds.col]), centers=4)

# SOM (Self-Organising Maps) clustering – for Fig.5a
install.packages("kohonen") # run it once only
library(kohonen)
# division into 400 clusters – this takes long on a large dataset
som_grid400<-somgrid(xdim=20, ydim=20, topo="hexagonal") # 400 clust
som_model400<-som(as.matrix(sub.firms[,crds.col]), grid=som_grid400, rlen=500, alpha=c(0.05,0.01), keep.data=TRUE)  # SOM clustering

# Split into 4 clusters – for Fig.6b
som_grid4<-somgrid(xdim=2, ydim=2, topo="hexagonal") # 4 clusters
som_model4<-som(as.matrix(sub.firms[,crds.col]), grid=som_grid4, rlen=500, alpha=c(0.05,0.01), keep.data=TRUE)  # SOM clustering

# Hierarchical clustering
dm<-dist(sub.firms[,crds.col]) # matrix of distances between obs.
hc<-hclust(dm, method="complete")   # simple dendrogram
clust<-cutree(hc, k=4)          # split into 4 clusters

# Figure 5
# Fig.5a – Typical visualization of SOM
coolBlueHotRed<-function(n, alpha=1) {rainbow(n, end=4/6, alpha=alpha)[n:1]}    # nice colours
par(mar=c(1,1,1,1)) # small margins around the figure
plot(som_model400, type="property", main="SOM clustering of x coordinate", property=getCodes(som_model400)[,1], palette.name=coolBlueHotRed)

# Fig.5b – Typical dendrogram visualisation
plot(hc, labels=FALSE, hang=-1, sub="", main="Dendrogram of xy coordinates – selection of 4 clusters") # dendrogram
rect.hclust(hc , k=4, border=2:5) # coloured boxes around clusters

Figure 5: Typical visualisations of clustering methods: a) Self-Organising Maps (SOM) for 400 clusters, b) hierarchical tree (dendrogram) with 4 clusters highlighted

# Figure 6
# Fig.6a – Mapping of k-means clusters  
par(mar=c(2,2,2,2)) # slightly wider margins
plot(sub.firms[,crds.col], bg=viri_cols[kmeans_model$cluster], pch=21, cex=0.99, main="K-means clustering")
points(kmeans_model$centers, pch=21, bg="yellow", cex=3) #centroids

# Fig.6b – Mapping of SOM (Self-Organising Maps) clusters
plot(sub.firms[, crds.col], pch=21, cex=0.99, bg=pretty_palette[som_model4$unit.classif],  main="SOM clustering")
points(as.data.frame(som_model4$codes), pch=21, bg="yellow", cex=3) #centroids

# Fig.6c – Mapping of hierarchical clusters (from dendrogram)
clust<-cutree(hc, k=4) # split into 4 clusters
plot(sub.firms[,crds.col], bg=pretty_palette[clust], pch=21, cex=0.99, main="Hierarchical clustering")

# Find and plot centroids among clustered points (Fig.6c)
# sf_multipoint() function finds a centroid point among points
# for() loop finds centroids in each cluster
library(sfheaders) # for sf_multipoint() function
tree.crds<-data.frame(X=0, Y=0) # object to store centroids
for(i in 1:4){  # loop for interation over each cluster i
iter.rows<-which(clust==i)  # select points from each cluster i
sf.mp<-sf_multipoint(sub.firms[iter.rows, crds.col]) #sfheaders::
tree.crds[i, ]<-st_coordinates(st_centroid(sf.mp))} # save xy
points(tree.crds, pch=21, bg="yellow", cex=3) # plot centroids

# Fig.6d – Silhouette for k-means
library(cluster)
library(factoextra)
sil<-silhouette(kmeans_model$cluster, dist(sub.firms[, crds.col]))
fviz_silhouette(sil) + scale_colour_viridis_d() +  theme(legend.position="none") # plot silhouette

# Rand Index (RI)
library(clv)
# std.ext() takes both clustering vectors to compare
clust_1_2<-std.ext(kmeans_model$cluster, som_model4$unit.classif)
clust_1_3<-std.ext(kmeans_model$cluster, clust)
clust_2_3<-std.ext(som_model4$unit.classif, clust)

rand_1_2<-clv.Rand(clust_1_2)   # RI for k-means & SOM
rand_1_3<-clv.Rand(clust_1_3)   # RI for k-means & hierarchical
rand_2_3<-clv.Rand(clust_2_3)   # RI for SOM & hierarchical

Figure 6: Clustering of geo-coordinates yielding catchment areas: a) k-means; b) SOM; c) hierarchical; d) silhouette for k-means

2.2. Quick Density Clustering (QDC)

The first part of the code (Fig.7a) is to prepare an illustration of how the intuition of QDC works. It starts by creating a box (polygon) with the st_sfc() command, which will be a place to display other elements. Inside this box, the random points are sampled (with st_sample()) and used for further processing: getting their coordinates with st_coordinates() from sf::, listing their few nearest neighbours with kNN() from dbscan::, plotting points with points(), plotting lines between points with line() and circles around the points with draw.circle() from plotrix::. The text is added using text(). Note that when plotting only contours of sf objects (with no data attached), use the st_geometry() command to extract the geometry from the whole sf object.

# Fig.7a – Overview of how QDC works
# Simulate a random point pattern to overview QDC construction

install.packages("dbscan")  # run it once only
install.packages("plotrix")     # run it once only

library(dbscan)     # for kNN()
library(plotrix)        # for draw.circle()

# Randomly drawn n=25 points (2D, xy) inside the box 
set.seed(1235)  # for reproducible result
points.random<-st_sample(polygon, 25, type="random") # class sf

# Plot box and points (with ID labels)
plot(st_geometry(polygon)) # plot the box
axis(1); axis(2)            # plot the axes
plot(st_geometry(points.random), add=TRUE, pch=21, cex=0.8)# points

# Extract xy crds to data.frame and use them to locate the text
pts.random<-as.data.frame(st_coordinates(points.random)) # xy crds
text(pts.random[,1], pts.random[,2]+0.03, 1:25, cex=0.7) # text

# k nearest neighbours (k=5) – list of IDs and distances to knn
knn.pts.random<-kNN(as.matrix(pts.random), 5) # from dbscan::
knn.pts.random      # Output structure

# Fig.7a - plot the final figure 
skip<-c(21,1,20,17) # use to skip some points (optional)
plot(pts.random[-skip,1:2], main="Spatial variables: distance to KNN and fixed-radius NN", ylim=c(0,1), xlim=c(0,1), pch=21, bg="black")

# select the first point to make an analysis of neighbourhood
core<-19    # id of central point no.1 – select when needed
#core<-14   # id of central point no.2 – select when needed
vd<-knn.pts.random$dist[core, ] # vector of distances to knn=5
vec.id<-knn.pts.random$id[core, ] # vector of IDs of knn

# plot lines, from (x1,x2) to (y1,y2)
for(i in 1:3){  # for knn=3 (out of knn=5)
lines(pts.random[c(core, knn.pts.random$id[core, i]),1], pts.random[c(core, knn.pts.random$id[core, i]),2])}

# draw a circle around the first core point and put the text
where<-1    # position of text in the upper part of figure
#where<-0.91    # location for the 2nd row of lines and text
draw.circle(pts.random[core,1], pts.random[core,2], radius=0.2)
# adjust the message: high/low density and number of FrNN
text(0, where-0.025,"total distance to knn=3 for this density", pos=4)
text(pts.random[core,1], pts.random[core,2], "FrNN=5", pos=4)

# upper red lines in ‘aggregated’ form (x1,x2) to (y1,y2)
clr<-"red"  # define colour

lines(c(0,vd[1]), c(where, where), col=clr) 
points(c(0,vd[1]), c(where, where), pch=21, bg=clr, cex=0.8)

lines(c(vd[1], vd[1]+vd[2]), c(where, where), col=clr)
points(c(vd[1], vd[1]+vd[2]), c(where, where), pch=21, bg=clr, cex=0.8)

lines(c(vd[1]+vd[2], vd[1]+vd[2]+vd[3]), c(where, where), col=clr)
points(c(vd[1]+vd[2], vd[1]+vd[2]+vd[3]), c(where, where), pch=21, bg=clr, cex=0.8)

The second part of the code (Fig.7b) represents the step-by-step analysis of QDC. It is based on two functions from the dbscan:: package: kNNdist(), which lists the points that are the k nearest neighbours of each point and the distance to them, and frNN(), which lists the points that are within a fixed radius of a given point and the distance to the core. This data becomes a newly created spatial variables. The command density() from stats:: estimates the statistical density function and data.Normalization() from clusterSim:: normalises the given variables. The normalisation (standarisation) could be done with scale() from stats::, but this function changes the names of the variables and causes some conflicts and problems with the visibility of the data.

# Fig.7b – Statistical distributions of spatial variables in QDC

# Randomized and jittered dataset of firms’ locations
# Rows to be included in a subsample of 5000 obs.
selected<-sample(1:dim(firms)[1], 5000, replace=FALSE)

# Epsilon to jitter points – to avoid overlapping locations
# Normal distribution to make spatial shifts multi-directional
xee<-rnorm(5000, 0, 0.0001)
yee<-rnorm(5000, 0, 0.0001)

# new dataset in data.frame class – with selected rows only
dens<-data.frame(x=firms$lon[selected], y=firms$lat[selected])
dens$x<-dens$x+xee  # new longitude, corrected by epsilon
dens$y<-dens$y+yee  # new latitude, corrected by epsilon

library(dbscan)     # for kNNdist() and frNN()

# first spatial variable – total distance to knn
# calculate distances to knn=10 based on xy coordinates
# kNNdist() returns knn columns with consecutive distances to knn-th
# apply() calculates sum of rows (1) from given object 
knn.dist<-kNNdist(dens[,1:2], 10, all=TRUE)
dens$knn10dist<-apply(knn.dist,1,sum)   # sum of distances to 10 knn

# second spatial variable – number of neighbours in fixed radius
# number of neighbours within a fixed radius in eps=0.05 (ca.5 km)
# frNN() returns a list of neighbour IDs for each observation
# lapply() measures the length of each list element
# unlist() changes class list to class data.frame
agg.radius<-frNN(as.matrix(dens[,1:2]), eps=0.05)
dens$agg<-unlist(lapply(agg.radius$id, length))

# normalise (standarise) the spatial variables
install.packages("clusterSim")  # run it only once
library(clusterSim)         # for data.Normalisation()
dens$knn10dist.s<-data.Normalization(dens[,3], type="n1", normalization="column")
dens$agg.s<-data.Normalization(dens[,4], type="n1", normalization="column")
head(dens)                  # inspection of the object
# both spatial variables with other parameters (eps=0.10, knn=30)
# variables are not added to a previous data object to keep it clean
knn.dist<-kNNdist(dens[,1:2], 30, all =TRUE)
sum.row<-apply(knn.dist,1,sum)

agg.radius<-frNN(as.matrix(dens[,1:2]), eps=0.10)
agg<-unlist(lapply(agg.radius$id, length))

# Fig.7b - Plot the statistical distribution of spatial variables
plot(density(dens[,5]), ylim=c(0,2), lwd=2, main="Density distribution of spatial variables")
lines(density(dens[,6]))
lines(density(scale(sum.row)), lwd=2, lty=3) # normalized by scale()
lines(density(scale(agg)), lty=3)
legend("topright", legend=c("sum of dist. to knn=10 (scaled)", "sum of dist. to knn=30 (scaled)", "numb. of neighb. in r=5 km (scaled)", "numb. of neighb. in r=10 km (scaled)"), lty=c(1,3,1,3), lwd=c(2,2,1,1), bty="n")

Figure 7: Spatial variables for a given point pattern: a) construction of spatial variables for selected points; b) statistical distributions of spatial variables with different parameters eps and knn

The third part of the code (Fig.8) represents two spatial variables mapped – an interesting thing is how to plot them in sp format, which requires a transformation of the data to sp class. There are also exceptional functions with different syntax: coordinates(), which not only extracts the (x,y) coordinates but also changes the class of the object from data.frame to sp, and proj4string(), which sets the correct projection for the sp object.

# Fig.8 – Maps of both spatial variables (using sp graphics) (QDC)

# Convert data.frame object to sp class – based on Fig.7a code
install.packages("sp")  # run only once
library(sp)         # convert data.frame to sp class
density.sp<-dens            # copy the data.frame object
coordinates(density.sp)<-c("x","y") # change to sp class
proj4string(density.sp)<-CRS("+proj=longlat +datum=NAD83") 

spplot(density.sp["knn10dist.s"], main="Standardised sum of distances \n to 10 nearest neighbours", key.space=list(x=0.7,y=.25, corner=c(0.05,0.75))) # \n makes a new line (the same as hard enter)

spplot(density.sp["agg.s"], main="Standardised number of neighbours \n in radius of 5 km", key.space=list(x=0.7,y=.25, corner=c(0.05,0.75)))

Figure 8: Spatial distributions of standardized density variables: a) sum of distances to k=10 nearest neighbours; b) number of neighbours within radius of r=5 km

The fourth and further parts of the code (Fig.9,10,11,12) present the QDC in a step-by-step version and automated solution with the spatialWarsaw:: package, which is available on GitHub. It includes not only k-means clustering but also SOM and hierarchical partitioning. Plotting is done using ggplot() commands. It also includes the code for Feature Importance in k-means clustering, that was mentioned in the text. It uses FeatureImpCluster() from the FeatureImpCluster:: package for feature importance itself and as.kcca() from the flexclust:: package to covert the output into the required kcca class.

Plots generated using the ggplot() command use a number of colouring options. In general, functions with ‘fill’ in the name are for colouring areas (such as regions), while those with ‘colour’ are for colouring points. Most palettes are for continuous colours, while some that include ‘d’ or ‘binned’ are for discrete colours. Some palettes, such as viridis, are suitable for colour-blind people and convert well to greyscale. Therefore colour options like scale_fill_viridis(), scale_fill_viridis_c(), scale_fill_brewer() are for colouring regions with continuous colours; options like scale_fill_viridis_d() are for colouring regions with discrete colours, options like scale_colour_gradient() or scale_color_viridis_c() are for colouring points with continuous colours, and options like scale_colour_viridis_d() are for colouring points with discrete colours.

# Fig.9 – Density clustering with QDC (using ggplot graphics)

# k-means clustering of spatial variables (based on Fig.7a code)
km.set1<-kmeans(dens[ ,5:6], 3) # 3 clusters based on 2 spat.var.
dens$clusters_kmeans<-as.factor(km.set1$cluster)

colnames(dens)<-c("x", "y", "knndist", "frnn", "Total_distance_to_k_nearest_neighbours", "Number_of_points_around", "Cluster_ID")

# Scatterplot of both spatial variables, k-means clusters in blue / red / green colours
ggplot(dens, aes(x=Total_distance_to_k_nearest_neighbours, y=Number_of_points_around, color=Cluster_ID)) + geom_point() + annotate("text", x=2, y=2, label="high density
(x) (low x) – close to k nearest neighbours, 
(y) (high y) - many neighbours around", col="blue") + annotate("text", x=3, y=-0.4, label="low density
(x) (high x) – far to k nearest neighbours, 
(y) (low y) – only a few neighbours around", col="red") + annotate("text", x=1.5, y=0.5, label="medium density", col="darkgreen")

# Location map of points, k-means clusters in colours
ggplot(dens, aes(x=x, y=y, color=Cluster_ID)) +  geom_point() + theme(legend.position="none")

# The same in viridis colours
# Scatterplot of both spatial variables, k-means clusters in colours
ggplot(dens, aes(x=Total_distance_to_k_nearest_neighbours, y=Number_of_points_around, color=Cluster_ID)) + geom_point() + annotate("text", x=3, y=2, label="high density
(x) (low x) – close to k nearest neighbours, 
(y) (high y) - many neighbours around") + annotate("text", x=3.5, y=-0.4, label="low density
(x) (high x) – far to k nearest neighbours, 
(y) (low y) – only a few neighbours around") + annotate("text", x=1.5, y=0.5, label="medium density") + scale_colour_viridis_d(option="H")

# Location map of points, k-means clusters in colours
ggplot(dens, aes(x=x, y=y, color=Cluster_ID)) +  geom_point() + theme(legend.position="none") + scale_colour_viridis_d(option="H")

# QDC clustering with spatialWarsaw:: 
packageinstall.packages("devtools") # to install from Github
devtools::install_github("poktam/spatialWarsaw")
library(spatialWarsaw)

# QDC() automatically produces a table with the result
# By default it will also produce graphs like Fig.9
qdc<-QDC(firms.sf[selector,], 5000, nclust=3, k=10, eps=0.05)
head(qdc)   # it outputs points classification and graphics as above

# variable importance mentioned in the text
install.packages("flexclust")       # run it once only
install.packages("FeatureImpCluster")   # run it once only
library(flexclust)              # for as.kcca()
library(FeatureImpCluster)          # for FeatureImpCluster()
km.kcca<-as.kcca(km.set1, dens[,3:4])   # convert to kcca
FeatureImp_km<-FeatureImpCluster(km.kcca, as.data.table(dens[,3:4]))
plot(FeatureImp_km)             # show drivers of clustering

Figure 9: QDC (Quick Density Clustering) algorithm based on k-means: a) relation between the two spatial variables (x=total distance to k nearest neighbours, y=number of neighbours within a fixed radius), b) geographical location of detected clusters

# Fig.10 – Density clustering with Self-Organising Maps (SOM)

# SOM clustering (based on Fig.7a code)
# Spatial variables are the same as before, clusters are new
install.packages("kohonen")
library(kohonen)
som_grid<-somgrid(xdim=1, ydim=3, topo="hexagonal") # 3 clusters
som_model<-som(as.matrix(dens[,5:6]), grid=som_grid, rlen=500, alpha=c(0.05,0.01), keep.data=TRUE) # cluster into 3 clusters
dens$clusters_som<-som_model$unit.classif # cluster ID

# Scatterplot of both spatial variables, SOM clusters in colours
ggplot(dens, aes(x=knn10dist.s, y=agg.s, color=clusters_som)) +  geom_point()+ theme(legend.position="none") 

# Location map of points, SOM clusters in colours
ggplot(dens, aes(x=x, y=y, color=clusters_som)) +  geom_point() + theme(legend.position="none")

Figure 10: Clustering of spatial variables using Self-Organising Maps (SOM): a) relation between both spatial variables (x=total distance to k nearest neighbours, y=number of neighbours within a fixed radius), b) geographical location of detected clusters

# Fig.11 – Density clustering with hierarchical tree

# Hierarchical clustering (based on Fig.7a code)
dm<-dist(dens[,5:6]) # mutual distances between observations
hc<-hclust(dm, method="complete")   # simple dendrogram
clust<-cutree(hc, k=3)          # split into 3 clusters
dens$clusters_tree<-clust           # add cluster ID to data.frame

# Scatterplot of spatial variables, hierarchical clusters in colours
ggplot(dens, aes(x=Total_distance_to_k_nearest_neighbours, y=Number_of_points_around, color=clusters_tree)) +  geom_point()+ theme(legend.position="none") + scale_colour_viridis(option="A")

# Location map of points, hierarchical clusters in colours
ggplot(dens, aes(x=x, y=y, color=clusters_tree)) +  geom_point() + theme(legend.position="none") + scale_colour_viridis(option="A")

Figure 11: Clustering of spatial variables using hierarchical trees: a) relation between both spatial variables (x=total distance to k nearest neighbours, y=number of neighbours within a fixed radius), b) geographical location of detected clusters

# Tab.4 – Statistics of spatial variables in clusters 

# Displays the statistics of analysed variables in cluster groups
# based on codes from Fig.7,9,10,11
install.packages("psych") # run it once only
library(psych)
describeBy(dens[,5:6], dens$clusters_kmeans)
describeBy(dens[,5:6], dens$clusters_som)
describeBy(dens[,5:6], dens$clusters_tree)

# Rand Index for comparing the partitioning mentioned in the text
install.packages("clv") # run it once only
library(clv)
ext<-std.ext(as.vector(dens$clusters_kmeans), as.vector(dens$clusters_som)) # k-means vs. SOM
clv.Rand(ext) # Rand index

Table 4: Parameters and composition of clusters of spatial variables (for n=5000 observations) visualised in Fig.12

# Fig.12 – Location of clusters (in different partitioning)
# based on codes from Fig.7,9,10,11

# k-means clusters – all points in grey, clusters in colour
# Fragment of code common to all figures
par(mar=c(1,1,1,1)) # narrow margins in the plot
plot(dens[,1:2], pch=".", main="cluster 1", col="grey70", axes=F)
plot(st_geometry(region), add=TRUE) # add contour

# Points in colour for specific cluster – please customise
# Select variable: clusters_kmeans, clusters_som or clusters_tree
# Select cluster ID: 1,2,3
# Select colour: red, green, blue, pink
points(dens[dens$clusters_kmeans==1,1:2], bg="red", pch=21, cex=1.2)

Figure 12: Location of QDC clusters from Tab.4 (knn=10, radius=5 km) using different clustering algorithms: a) k-means, b) SOM, c) hierarchical trees

2.3 DBSCAN and its extensions

2.3.1 DBSCAN in 2D

This code presents how the DBSCAN algorithm works. The first part of the code (Tab.5) shows how the rasterisation with terra:: works. This is to understand the density of the pattern and magnitude of the nearest neighbourhood – necessary information for successful clustering with dbscan() by setting the appropriate parameters. The second part of the code (Fig.14 and Fig.15) runs DBSCAN with different parameters. The loop for() performs the calculations for a set of parameters.

# Tab.5 – Statistics of different raster divisions
# Run the code for samples and rasters of different size

# add vector of ones to count points with sum function
sub.firms$ones<-rep(1, times=dim(sub.firms)[1]) 

install.packages("terra") # run it once only
library(terra)      # for rasterize()
bb<-st_bbox(region)     # bounding box over the contour shape
bb              # extreme x and y coordinates
# building the raster – select 50, 100, 150, 200 cells per dimension
cells<-50           # number of cells in rows and columns
r<-rast(nrows=cells, ncols=cells, xmin=bb[1], ymin=bb[2], xmax=bb[3], ymax=bb[4]) # general raster, cells only 

crds.col<-9:10  # in which columns lon and lat are stored
rast.firms<-rasterize(as.matrix(sub.firms[, crds.col]), r, value=firms$ones,  fun=sum) # number of points in each raster cell

plot(rast.firms, main="Firms density", mar=c(4,4,4,4)) 
plot(st_geometry(region), add=TRUE)

# extract the vector of counts  
rast.firms.vec50<-as.vector(rast.firms$sum)
summary(rast.firms.vec50)   # data to report in Tab.5

Table 5: Statistics on the number of points within raster cells of different sizes for a full sample of 1 million points and a sub-sample of 50,000 points

# Fig.14 – Alternative DBSCAN high density clusters

# DBSCAN with different parameters - please customise
# Select the dataset (or subset)
# Select eps – radius of ring
# Select minPts – minimum number of points in the ring

library(dbscan)
db<-dbscan(sub.firms[, crds.col], eps=0.0157, minPts=14) # Fig.14a
db<-dbscan(sub.firms[, crds.col], eps=0.0157, minPts=70) # Fig.14b

# simple plot
plot(sub.firms[, crds.col], col=db$cluster+1, pch=20)

Figure 14: Alternative DBSCAN clustering: a) for 1 million points, minPts=ln(n)=14 and eps=0.0157; b) for 50K points, minPts=70 and eps=0.0157

# the same plot with stronger control of colours
# clusterID as colour
greys<-paste0(rep("grey", times=length(db$cluster)), db$cluster)
# light grey for noise
plot(sub.firms[db$cluster==0, crds.col], col="grey90", pch=20)
# dark grey colours for high-density clusters
points(sub.firms[db$cluster!=0, crds.col], col=greys[db$cluster!=0], pch=20)

Figure 14 (grey tones): Alternative DBSCAN clustering: a) for 1 million points, minPts=ln(n)=14 and eps=0.0157; b) for 50K points, minPts=70 and eps=0.0157

# Fig.15 – Summary of alternative DBSCAN clusters (different minPts)

library(dbscan)
# output object for DBSCANs in the loop, for 30 iterations
dbs<-matrix(0, nrow=30, ncol=2) 
colnames(dbs)<-c("how.many.clusters", "noise.ratio")
size<-dim(firms)[1] # number of observations in a data set

# it may take long, make it faster by using a smaller sample
my.pts<-(1:30)*10 # vector of minPts for iterations: 10, 20, …, 300
for(i in 1:30){   # loop with 30 iterations
dbi<-dbscan(sub.firms[, crds.col], eps=0.0157, minPts=my.pts[i])
dbs[i,1]<-max(dbi$cluster) # max cluster ID – how many clusters
dbs[i,2]<-length(which(dbi$cluster==0))/size # noise ratio
}   # dbscan() returns cluster ID for each point, 0 is noise

# Plot of number of clusters in each iteration
plot(my.pts, dbs[,1], type="l", main="Number of clusters in DBSCAN", xlab="Minimum number of points in a radius (minPts)", ylab="Number of clusters")
points(my.pts, dbs[,1], pch=21, bg="darkmagenta")
abline(h=(1:10)*25, lty=3, col="grey80")

# Plot of noise ratio in each iteration
plot(my.pts, dbs[,2], type="l", main="Ratio of noise in DBSCAN", xlab="Minimum number of points in a radius (minPts)", ylab="Ratio of noise", ylim=c(0.2, 0.6))
points(my.pts, dbs[,2], pch=21, bg="darkmagenta")
abline(h=(1:12)*0.05, lty=3, col="grey80")

Figure 15: DBSCAN performance as a function of the minPts parameter: a) number of clusters, b) noise ratio

2.3.2 DBSCAN in 3D

This is the code for DBSCAN in 3D. It starts (Fig.16) by generating a point pattern in 3D – locations (x,y) that make up two dimensions (2D) and attached values that make up the 3rd dimension (z). It uses rnorm() to generate random values from the normal distribution and data.frame() to create a new data.frame object. In the second part (Fig.17) it runs DBSCAN on 2D and 3D normalised data to check the clustering outcomes.

# Fig.16 – Generate data for DBSCAN in 3D – xyz dimensions

# Generate three locational clusters
X1<-rnorm(10, mean=1, sd=0.1)   # locations X in cluster 1
Y1<-rnorm(10, mean=3, sd=0.1)   # locations Y in cluster 1
Z1<-rnorm(10, mean=1, sd=0.05)  # values in point – small circles

X2<-rnorm(10, mean=1, sd=0.18)  # locations X in cluster 2
Y2<-rnorm(10, mean=1, sd=0.18)  # locations Y in cluster 2
Z2a<-rnorm(5, mean=1, sd=0.05)  # values in point – small circles
Z2b<-rnorm(5, mean=5, sd=0.5)   # values in point – big circles

X3<-rnorm(10, mean=3, sd=0.20)  # locations X in cluster 3
Y3<-rnorm(10, mean=2, sd=0.20)  # locations Y in cluster 3
Z3<-rnorm(10, mean=5, sd=0.5)   # values in point – big circles

# Generate ‘noise’ points – not in the clusters
X4<-rnorm(10, mean=2, sd=0.99)  # locations X
Y4<-rnorm(10, mean=2, sd=0.99)  # locations Y
Z4a<-rnorm(5, mean=1, sd=0.05)  # values in point – small circles
Z4b<-rnorm(5, mean=5, sd=0.5)   # values in point – big circles

data<-data.frame(X=c(X1, X2, X3, X4), Y=c(Y1, Y2, Y3, Y4), Z=c(Z1, Z2a, Z2b, Z3, Z4a, Z4b)) # create a single data.frame
data    # display data.frame with three columns: X, Y, Z
plot(data[,1:2], cex=data[,3])  # plot: xy is location, z is size

Figure 16: Spatial setting for 3D DBSCAN

# Fig.17 – DBSCAN in 2D and 3D (based on code for Fig.16)

# DBSCAN in 2D (columns for X and Y) works well on both non-standardised and standardised data
# DBSCAN in 3D (columns for X, Y and Z) works well only on standardised data

library(dbscan) # for dbscan() function

# 2D DBSCAN – size and colour differs for noise and clustered data
# for noise – cluster==0 and points are smaller
db<-dbscan(data[,1:2], eps=0.3, minPts=5)
plot(data[,1:2], col=db$cluster+1, bg=db$cluster+1, pch=ifelse(db$cluster==0, 20, 21), cex=ifelse(db$cluster==0, 1.1, 1.7))

# 3D DBSCAN for normalised data – value and closeness clusters
library(clusterSim) # for normalization function
data.n<-data.Normalization(data, type="n1", normalization="column")

db<-dbscan(data.n[,1:3], eps=0.5, minPts=3) # on normalized data
plot(data.n[,1:2], col=db$cluster+1, bg=db$cluster+1, pch=ifelse(db$cluster==0, 20, 21), cex=ifelse(db$cluster==0, 1.2, data[,3]^0.5))

Figure 17: DBSCAN clustering: a) 2D data (location only), b) 3D data (location and values, all dimensions standardised)

2.3.3 Problems with DBSCAN and its extensions (Hierarchical DSBCAN, spatio-temporal DBSCAN)

This code uses the hdbscan() function from the dbscan:: package to prepare hierarchical DBSCAN clustering for a wide range of eps – note that the hdbscan() function uses only one parameter, minPts, while the other, eps, is included in the hierarchical clustering.

# Fig.18 –  Hierarchical DBSCAN (using dendrogram) for population

library(dbscan) # for hdbscan()

# Plot of a dendrogram, Fig.18a
hdb<-hdbscan(sub.popul, minPts=100) # Hierarchical DBSCAN
plot(hdb, gradient=c("yellow", "orange", "red", "blue"), show_flat=T) 

# Map of points, colours by clusters, Fig.18b
clust<-hdb$cluster*10
clust[clust==0]<-80
cols<-paste0("grey", clust)
plot(sub.popul, col=cols, bg=cols, pch=ifelse(hdb$cluster==0, 20, 21), cex=ifelse(hdb$cluster==0, 0.25, 0.75))

Figure 18: Optimal flat clustering solution from HDBSCAN: a) dendrogram of density clustering – possible partitioning based on radius eps; b) final partitioning

2.4 LOF, GLOSH and OPTICS as DBSCAN diagnostics

2.4.1 Local Outlier Factor (LOF)

This code is used to perform Local Outlier Factor (LOF) analysis. It generates the point pattern that replicates the assumed desired spatial distribution of points. An interesting approach is to start with a matrix of mutual distances, which is specified by the user – deliberately, some points are close together, and some are much further apart. A method that allows to switch between the distance matrix and the (x,y) locations of the points is Multidimensional Scaling (MDS), which is performed with the mds()command from the smacof:: package. The MDS input is a distance matrix, while the output is (x,y) coordinates.

# Fig.19 – Data used to calculate LOF (Local Outlier Factor)

# Predefined example matrix of mutual distances between points
my.dist<-matrix(c(0, 1.4, 2.1, 4.8, 14, 1.4, 0, 1.3, 5.1, 13.2, 2.1, 1.3, 0, 4.2, 12, 4.8, 5.1, 4.2, 0, 12.7, 14, 13.2, 12, 12.7, 0), nrow=5, ncol=5, byrow=TRUE, dimnames=list(c("A", "B", "C", "D", "E"), c("A", "B", "C", "D", "E")))

# Multidimensional scaling (MDS) to extract location from distances
# plot in 2D (ndim=2)
install.packages("smacof")
library(smacof)     # for mds()
my.mds<-mds(my.dist, ndim=2,  type="ratio")     # from smacof::
my.mds                              # see quality of fit
summary(my.mds) # see new coordinates & stress per point
my.mds$conf # coordinates of points 

library(dbscan) # for LOF statistics
my.lof<-dbscan::lof(my.mds$conf, minPts=4)  # LOF statistics

par(mar=c(2,2,1,1)) # margins of figure
plot(as.data.frame(my.mds$conf), pch=21, main="LOF (minPts=3), {dbscan}", xlim=c(-0.7, 1.3), ylim=c(-0.7, 1.3), bg="grey80")
points(my.mds$conf, cex=(my.lof-1)*2, pch=1, col="red") #red circle
abline(h=(-2:2)*0.5, lty=3, col="grey85")
abline(v=(-2:2)*0.5, lty=3, col="grey85")
text(my.mds$conf, labels=round(my.lof, 2), pos=3) # labels
text(my.mds$conf, labels=rownames(my.mds$conf), pos=1)

Figure 19: Example of spatial structure for calculating the Local Outlier Factor LOF: a) matrix of mutual distances between points, b) location of points with values of LOF

2.4.2 Global-Local Outlier Score from Hierarchies (GLOSH)

This code computes GLOSH statistics using the pointdensity() function from the dbscan:: package. The input is the locations generated from the MDS in Fig.19.

# Fig.21 – Calculation of individual point densities
# (based on code for Fig.19)
# Select the eps parameter (0.5 or 0.9), use dbscan:: package

my.crds<-my.mds$conf        # new object with point coordinates
library(dbscan)         # for pointdensity()
point.density<-pointdensity(my.crds, eps=0.9, type="density")
point.density

# xy location reflects mutual distances, point size reflects density
plot(my.crds, pch=19, main="Density (eps=0.9)", cex=point.density*5, ylim=c(-1, 1.5), xlim=c(-1, 1.5))
text(my.crds[,1], my.crds[,2]-0.1, labels=round(point.density,2))
draw.circle(my.crds[5,1], my.crds[5,2], radius=.9)  # ring
draw.circle(my.crds[4,1], my.crds[4,2], radius=.9)  # ring

Figure 21: Individual point densities: a) for radius eps=0.5, b) for radius eps=0.9

2.4.4 Empirical analysis of LOF, GLOSH and OPTICS diagnostics measures

This code uses empirical data to compare three outlier measures: LOF, GLOSH and OPTICS. In the first part (Tab.9), the code uses the lof() command the from dbscan:: package to find local outliers for the population data. In a for() loop, it finds LOF for different minPts parameters. It binds the separate results into a single object using the rbind() command. As the loop saves the results with the same name in each iteration, one should change the name of the outcome object according to the iteration run – this was done with assign() which assigns a new name to an existing object and paste0(), which pastes the fixed and changing elements of the name into a single name. There are three implementations of LOF statistics in R: in the dbscan::, DescTools:: and Rlof:: packages. They produce the same results. Only dbscan:: deals with large data in a reasonable way. When specifying the number of k nearest neighbours, {dbscan} includes the point itself, so to make the results comparable, in dbscan:: set knn=k+1 compared to DescTools:: and Rlof::. In Fig.22 it calculates and maps LOF values for the selected minPts value for the population data, while for Fig.23, it does the same for two selected sectors from the firms’ dataset.

# Tab.9 – LOF values for different minPts

size<-dim(sub.popul)[1] # number of obs. in population dataset
breaks=c(0, 0.9, 1, 1.1, 1.5, 2, 3, 5, 15)  # intervals for LOF
minPts=c(25, 50, 100, 200, 400, 800)        # minPts levels

for(i in 1:6){                      # loop for each minPts
my.lof<-dbscan::lof(sub.popul, minPts=minPts[i])
t1<-table(cut(my.lof, breaks=breaks))/size # percentage structure
assign(paste0("t",i), t1)}  # names of objects: t1,t2,t3,t4,t5,t6

tab<-rbind(t1,t2,t3,t4,t5,t6) # merge tables into one object
tab

Table 9: LOF values (% in the interval) for different minPts in a sample of 10,000 points for population data

# Fig.22 – LOF to detect outliers in the population point pattern

my.lof<-dbscan::lof(sub.popul, minPts=50) # LOF for each point
plot(sub.popul, pch=".", main="LOF (minPts=50, n=10 000 obs)", axes=FALSE, xlab="", ylab="")
points(sub.popul[my.lof>2,], pch=1, col="green")  #LOF>2
points(sub.popul[my.lof>3,], pch=21, bg="red") #LOF>3
legend("bottomleft", legend=c("LOF<2", "LOF>2", "LOF>3"), fill=c("black", "green", "red"), bty="n", cex=0.8)

# The same code with the viridis palette
library(viridis)
viri_cols<-viridis(4, option="A") # colour palette

my.lof<-dbscan::lof(sub.popul, minPts=50) # LOF for each point
plot(sub.popul, pch=".", main="LOF (minPts=50, n=10 000 obs)", axes=FALSE, xlab="", ylab="", col="grey30")
points(sub.popul[my.lof>2,], pch=1, col=viri_cols[1])   #LOF>2
points(sub.popul[my.lof>3,], pch=21, bg=viri_cols[2], cex=2)#LOF>3
legend("bottomleft", legend=c("LOF<2", "LOF>2", "LOF>3"), fill=c("grey70", viri_cols[1], viri_cols[2]), bty="n", cex=0.8)
plot(st_geometry(region), add=TRUE, border="grey70")

Figure 22: Outliers measured with LOF for population point patterns

# Fig.23: LOF for business location in different sectors

# data subset for a given sector (A or M)
sub<-firms[firms$SEK_PKD7=="M", ] # select A or M to make a subset

my.lof<-dbscan::lof(sub[,9:10], minPts=50)
summary(my.lof) 
par(mar=c(1,1,1,1))     # small margins of figure
plot(sub[,9:10], pch=".", main="LOF (minPts=50, sector A)", axes=FALSE, xlab="", ylab="")
points(sub[my.lof>2,9:10], pch=1, col="green")  # LOF>2
points(sub[my.lof>3,9:10], pch=21, bg="red")    # LOF>3
legend("bottomleft", legend=c("LOF<2", "LOF>2", "LOF>3"), fill=c("black", "green", "red"), bty="n", cex=0.8)
plot(st_geometry(region), add=TRUE, border="grey80") # contour

Figure 23: LOF for business location: a) sector A, b) sector M

The second part of the code (Fig.24) calculates the GLOSH statistic for population data using the glosh() command from the dbscan:: package. It also compares the GLOSH values calculated with different knn values to check the sensitivity of the measure to this parameter.

# Fig.24: GLOSH statistics for population

library(dbscan)
GL.10<-glosh(sub.popul, k=10) # knn=10, subset of population data

par(mar=c(1,1,2,1)) # small margins of figure
plot(sub.popul, pch=".", main="GLOSH for population (knn=10)
using geo-coordinates as input", axes=FALSE, xlab="", ylab="")
points(sub.popul[GL.10>0.90, ], pch=1, col="green") 
points(sub.popul[GL.10>0.95, ], pch=21, bg="red", cex=0.6) 
legend("bottomleft", legend=c("GLOSH<0.9", "GLOSH>0.9", "GLOSH>0.95"), fill=c("black", "green", "red"), bty="n", cex=0.8)
plot(st_geometry(region), add=TRUE, border="grey80")

# Calculating GLOSH for different knn for Fig.24b (population data)
GL.10<-glosh(sub.popul, k=10)   
GL.20<-glosh(sub.popul, k=20)   
GL.40<-glosh(sub.popul, k=40)   
GL.80<-glosh(sub.popul, k=80)   
GL.160<-glosh(sub.popul, k=160)     
GL.320<-glosh(sub.popul, k=320)     

par(mar=c(4,4,4,4)) # bigger margins in the figure
plot(density(GL.10), ylim=c(0,4.5), xlab="", ylab="", main="GLOSH for different knn")
lines(density(GL.20), lty=2)
lines(density(GL.40), lty=3)
lines(density(GL.80), lty=4)
lines(density(GL.160), lty=5)
lines(density(GL.320), lty=6)
legend("topleft", legend=c("knn=10", "knn=20", "knn=40", "knn=80", "knn=160", "knn=320"), lty=c(1,2,3,4,5,6), bty="n", cex=0.8)

# Frequencies of GLOSH by intervals (interpreted in the text)
a10<-table(cut(GL.10, breaks=(0:10)/10))/nn
a20<-table(cut(GL.20, breaks=(0:10)/10))/nn
a40<-table(cut(GL.40, breaks=(0:10)/10))/nn
a80<-table(cut(GL.80, breaks=(0:10)/10))/nn
a160<-table(cut(GL.160, breaks=(0:10)/10))/nn
a320<-table(cut(GL.320, breaks=(0:10)/10))/nn
glosh.all<-rbind(a10, a20, a40, a80, a160, a320)
glosh.all

Figure 24: GLOSH statistics for population data: a) GLOSH for knn=10, b) statistical distribution (frequency) of GLOSH for different knn

The third part of the code (Fig.25) calculated the point density for population data. The results are added to the sf object and plotted using the ggplot() command. Two codes are proposed: using ggplot() to plot from sf and using ggplot() to plot from data.frame objects.

# Fig.25: Point density values for the population

# Point density  for population for Fig.25a and Fig.25b
library(dbscan)
f1<-pointdensity(sub.popul[,1:2], eps=0.5, type="frequency") 
f2<-pointdensity(sub.popul[,1:2], eps=0.025, type="frequency")

# add new columns to data.frame objects
sub.popul$f1<-f1        # for Fig.25a   
sub.popul$f2<-f2        # for Fig.25b

sub.popul.sf$f1<-f1 # add new columns do sf objects
sub.popul.sf$f2<-f2

# Plot from data.frame object
ggplot() + geom_sf(data=region, fill="lightgray") + geom_point(data=sub.popul, alpha=0.6, aes(x=x, y=y, color=f1)) 
+ scale_colour_gradient(name="Point density eps=0.5", low="yellow", high="red") + labs(title="Point density values for population", x="Longitude", y="Latitude") + theme_minimal()

# Plot from sf object
ggplot(sub.popul.sf) + geom_sf(aes(color=f2)) + scale_color_viridis_c(name="Point density eps=0.025", option="D") + labs(title="Point density values for population", x="Longitude", y="Latitude")

Figure 25: Point density values (frequency counts) for population data: a) for eps=5; b) for eps=0.025

In the fourth part of the code (Tab.10), the OPTICS algorithm (run with the optics() command from the dbscan:: package) is used together with the DBSCAN algorithm (run with the extractDBSCAN() and hullplot() commands from dbscan:: package). The example shows how sensitive both solutions are to the parameters used in the analysis.

# Tab.10: OPTICS and DBSCAN algorithms (for a full sample of the population)

library(dbscan) # for optics() and extractDBSCAN()

# for minPts=5 – top panel of Tab.10
res5<-optics(popul[,1:2], minPts=5) # columns 1:2 are xy coordinates
plot(res5, main="Reachability plot, minPts=5") # reachability plot

# use higher radius eps_cl=0.015 (horizontal line)
res.edb5<-extractDBSCAN(res5, eps_cl=0.015)
res.edb5                # printing output
plot(res.edb5)              # black is noise
hullplot(popul, res.edb5)   # plot the DBSCAN clusters

# using lower radius eps_cl=0.01 (horizontal line)
res.edb5<-extractDBSCAN(res5, eps_cl=0.01)
res.edb5                # printing output
plot(res.edb5)              # black is noise
hullplot(popul, res.edb5)   # plot the DBSCAN clusters

# for minPts=625            # bottom panel of Tab.10
res625<-optics(popul[,1:2], minPts=625)
plot(res625, main="Reachability plot, minPts=625")

# using higher radius eps_cl=0.14 (horizontal line)
res.edb625<-extractDBSCAN(res625, eps_cl=0.14)
res.edb625              # printing output
plot(res.edb625)        # black is noise
hullplot(popul, res.edb625)# plot the DBSCAN clusters

# using lower radius eps_cl=0.1 (horizontal line)
res.edb625<-extractDBSCAN(res625, eps_cl=0.1)
res.edb625              # printing output
plot(res.edb625)        # black is noise
hullplot(popul, res.edb625)# plot the DBSCAN clusters

Table 10: OPTICS and DBSCAN algorithms for 65,660 points representing spatial distribution of population

2.5 Density Peaks Clustering (DPC)

This code deals with the Density Peaks Clustering (DPC). In its first part (Fig.27) it generates the 3D pattern (locations and values) for the DPC analysis. It uses rnorm() to generate random values (x,y,z) from the normal distribution. It creates a single object with data.frame(). The analysis (Fig.27 & Fig.28) includes calculating the 2D distance with the dist() command and using the densityClust:: package to perform DPC with densityClust, plot the results with plotDensityClust() and search for clusters with findClusters(). The R package densityClust:: works well only for small samples, while for large data (more than 1500 observations) it fails. However, it produces nice plots with the ggplot() approach.

# Fig.27: DPC (Density Peak Clustering) in 2D

# Create 2D and 3D dataset for DPC (2D: X and Y, 3D: Z)
set.seed(777)       # for reproducibility of analysis

# Create three spatial clusters
X1<-rnorm(10, mean=1, sd=0.1)   # Cluster 1
Y1<-rnorm(10, mean=3, sd=0.1)
Z1<-rnorm(10, mean=1, sd=0.05)

X2<-rnorm(10, mean=1, sd=0.18)  # Cluster 2
Y2<-rnorm(10, mean=1, sd=0.18)
Z2a<-rnorm(5, mean=1, sd=0.05)
Z2b<-rnorm(5, mean=5, sd=0.5)

X3<-rnorm(10, mean=3, sd=0.20)  # Cluster 3
Y3<-rnorm(10, mean=2, sd=0.20)
Z3<-rnorm(10, mean=5, sd=0.5)

# Generate random points outside the clusters
X4<-rnorm(10, mean=2, sd=0.99)
Y4<-rnorm(10, mean=2, sd=0.99)
Z4a<-rnorm(5, mean=1, sd=0.05)
Z4b<-rnorm(5, mean=5, sd=0.5)

data<-data.frame(X=c(X1, X2, X3, X4), Y=c(Y1, Y2, Y3, Y4), Z=c(Z1, Z2a, Z2b, Z3, Z4a, Z4b)) # Merge all data into a single object
head(data)  # 40 observations, 3 columns (X, Y, Z)
plot(data[,1:2], cex=data[,3], pch=21, bg="black")  # plot in 3D
plot(data[,1:2], pch=21, bg="black", cex=1.5) # plot in 2D, Fig.27a

install.packages("densityClust")
library(densityClust)

# Inputs for DPC
dd<-dist(data[,1:2])    # Euclidean distance for (x,y) points
dpc<-densityClust(dd, gaussian=FALSE)    # decision graph for automatic radius, returns ‘distance cutoff’
# distance cutoff calculated to 0.06262244 

# decision graph dg, Fig.27b
# x(rho) is the number of neighbours in the radius, 
# y(delta) is the distance to the peak point
plot(dpc) # simple graph with a white background
plotDensityClust(dpc, type="dg")  # graph with a grey background

Figure 27: Density Peaks Clustering in 2D: a) analysed point pattern; b) decision graph

# Fig.28: DPC (Density Peak Clustering) in 2D (based on Fig.27)

# Fig.28a – Gamma (gg) graph, gamma (on y) is delta * rho
# on x are the first few observations sorted by gamma (y)
# number of density peaks are the first few top points
plotDensityClust(dpc, type="gg", n=40) # n=40 points to consider

# Fig.28b Clustering with own min. parameters from decision graph
dpc.clust<-findClusters(dpc, rho=1.5, delta=1.5)     
dpc.clust   # clustering parameters and share of points in core
clustered(dpc.clust)    # check if points are clustered
plotDensityClust(dpc.clust, type="mds") # results: points and labels

Figure 28: Density Peaks Clustering: a) gamma graph; b) final clustering

In the 3D approach (Fig.29) data were normalised before further steps with data.Normalization() from the clusterSim:: package and 3D distances are calculated with dist() – this becomes the input for previously applied DPC commands. The codes for Fig.30 apply the same analysis to population data – it uses the adpclust() command from the ADPclust:: package. Graphs have been generated using plotting functions from the densityClust:: package and ggplot() commands.

# Fig.29: DPC (Density Peak Clustering) in 3D (continued from Fig.27 and Fig.28)

# Normalise the data – to adjust the scale of (x,y) and z
library(clusterSim)
data.n<-data.Normalization(data, type="n1", normalization="column") 

dd<-dist(data.n)
dpc<-densityClust(dd, gaussian=FALSE)    
plotDensityClust(dpc, type="dg") # decision graph, Fig.29a
plotDensityClust(dpc, type="gg") # gamma graph, Fig.29b

# Plot the original 3D pattern
ggplot(data.n, aes(x=X, y=Y))+ geom_point(aes(size=Z)) # Fig.29c

# MDS-reduced final clustering
dpc.clust<-findClusters(dpc, rho=1.5, delta=1.5)     # clustering
dpc.clust   # clustering result
plotDensityClust(dpc.clust, type="mds") # clusters & labels, Fig.29d

Figure 29: DPC for 3D mixed data: a) decision graph, b) gamma graph, c) original spatial distribution with point size, d) MDS reduced 3D to 2D distribution with clusters

# Fig.30: DPC for population (solution for larger datasets)

install.packages("ADPclust")
library(ADPclust)

selector.popul<-sample(1:dim(popul)[1], 1500, replace=FALSE)
sub.popul<-popul[selector.popul, ] # subsample

# Density clusters are selected interactively by the users 
dpc.popul<-adpclust(sub.popul, centroids="user")

ggplot(sub.popul, aes(x=x, y=y)) + geom_point(shape=20, size=3, aes(color=dpc.popul$clusters)) + scale_color_viridis_c(name="DPC clusters") + labs(title="DPC clusters for population", x="Longitude", y="Latitude")

Figure 30: DPC for population data (sampled): a) decision graph with selection of cores, b) density clusters mapped

2.6 Spatial Scan (Kulldorff statistics)

The R code starts by reading areal NUTS4 / LAU1 poviat (area) data - two variables that are referenced to each other. The order of the data must match the order of the regions in the shapefile. FlexScan:: package works in sp class, therefore spatial objects have been converted from sf to sp with as_Spatial(). SpatScan statistics are easily computed using the flexscan() command, while the underlying data can be plotted using ggplot().

In the first part of the code (Tab.11 and Fig.31) the SpatScan statistic is calculated for regional data using the NTS4 level. In the second part of the code (Tab.12 and Fig.32), the SpatScan statistic is calculated for rasterised data. The rasters are created using the rast() command from the terra:: package, within the bounding box, that was obtained with st_bbox(). The aggregation of variables into raster cells is done with rasterize(). As the SpatScan statistic requires sp class, output from terra:: is first converted first to sf class and then to sp class – there is no direct conversion from terra:: to sp::. This conversion uses the stars:: package. SpatScan is run with the rflexscan:: package and uses a spatial weight matrix indicating the neighbourhood structure, derived with the spdep:: package.

# Tab.11 and Fig.31: Spatial Scan (Kulldorff Statistics)

install.packages("FlexScan")
library(FlexScan)

# Counts of two distributions – firms and population
# 42 NTS4 subregions of Mazovia region - ordered as in a shapefile
# NTS4_data object with data was created in ‘starter’ section

# NTS4 map in sp class, converting NTS4 object created in ‘starter’
NTS4.sp<-as_Spatial(NTS4, cast=TRUE, IDs="jpt_kod_je") # sp class

# Tab.11 and Fig.31a - SpatScan for firms per capita
fs<-flexscan(NTS4.sp, NTS4_data$firms, NTS4_data$population)
fs

# Fig.31b - Visualise the ratio of ‘firms per capita’
# ratio: firms per inhabitant in NTS4 regions
NTS4$ratio<-NTS4_data$firms/NTS4_data$population

# plot figure
ggplot(data=NTS4) + geom_sf(aes(fill=ratio)) +   scale_fill_viridis_c(option="plasma", guide=guide_colorbar(barwidth=15), name=NULL) + coord_sf(xlim=c(19, 23), ylim=c(51, 53.5), expand=TRUE) + theme(legend.position='bottom', legend.text= element_text(size=8), legend.title=element_text(size= 5), legend.key.size=unit(0.2, "cm")) 
+ labs(title="Ratio: firms per capita")

Figure 31: Visualisation of SpatScan: a) detected clusters, b) real data – risk distribution

# Tab.12 and Fig.32: SpatScan for rasterised data (20*20) & (50*50)

install.packages("stars")
install.packages("rflexscan") 
install.packages("spdep")

library(stars)      # for conversion using st_as_sf.stars()
library(terra)      # for rast() and rasterize()
library(rflexscan)  # for SpatScan statistic using rflexscan()
library(spdep)      # for neighbourhood matrix

nrows.raster<-50        # size of raster in rows, can be 20
ncols.raster<-50        # size of raster in columns, can be 20
nmax<-nrows.raster*ncols.raster # number of raster cells
bb<-st_bbox(NTS4)   # bounding box

# create raster r with user-defined number of rows and columns
r<-rast(nrows=nrows.raster, ncols=ncols.raster, xmin=bb[1], ymin=bb[2], xmax=bb[3], ymax=bb[4]) # with terra::

# population point data – rasterise and get a vector of values
popul$ones<-rep(1, times=dim(popul)[1]) # to count points with sum()
rast.popul<-rasterize(as.matrix(popul[,1:2]), r, value=popul$ones,  fun=sum) # it counts number of points (sum of ‘ones’) in raster r

plot(rast.popul, main="Population density", mar=c(4,4,4,4))
plot(st_geometry(region), add=TRUE) # add contours

rast.popul.vec<-as.vector(rast.popul$sum) # raster values as vector
rast.popul.vec[is.na(rast.popul.vec)==TRUE]<-0 # NA changed to 0

# firms point data – rasterise and get a vector of values
firms$ones<-rep(1, times=dim(firms)[1]) # column of ‘1’ values
rast.firms<-rasterize(as.matrix(firms[ ,crds.col]), r, value=firms$ones,  fun=sum) # counts number of points in raster r

plot(rast.firms, main="Firms density", mar=c(4,4,4,4)) 
plot(st_geometry(region), add=TRUE) # add contours

rast.firms.vec<-as.vector(rast.firms$sum) # raster values as vector
rast.firms.vec[is.na(rast.firms.vec)==TRUE]<-0 # NA changed to 0

# convert raster from terra class to sf class and to sp class
r.stars<-stars:::st_as_sf.stars(stars::st_as_stars(rast.popul), point=FALSE, merge=FALSE, connect8=TRUE, na.rm=FALSE) # class sf
r.sp<-as_Spatial(st_geometry(r.stars)) # class sp

# a matrix of centroids of rasters
my.xy<-st_coordinates(st_centroid(r.stars))

# expected number of firms per cell (based on general ratio)
expected<-rast.popul.vec * (sum(rast.firms.vec) / sum(rast.popul.vec))

# spatial neighbourhood nb matrix from raster in sp class 
nb<-poly2nb(r.sp) # with spdep::

# SpatScan statistic uses nb matrix created using poly2nb()
fs2<-rflexscan(x=my.xy[,1], y=my.xy[,2], name=1:nmax, observed=rast.firms.vec, expected=expected, nb=nb, stattype= "RESTRICTED") # stattype can be also "ORIGINAL" (method used)

fs2         # Tab.12 show main information about clusters
summary(fs2)    # details of clusters
fs2$cluster[[1]]$area       # ids of rasters in given cluster

par(mar=c(1,1,1,1)) # narrow margins of figure
plot(r.sp, border="grey80") # plot of raster in sp class
plot(r.sp[fs2$cluster[[1]]$area], add=TRUE, col="red") # prime

for(i in 2:22){ # secondary significant, known from summary(fs2)
plot(r.sp[fs2$cluster[[i]]$area], add=TRUE, col="blue")}

for(i in 23:33){ # secondary insignificant, known from summary(fs2)
plot(r.sp[fs2$cluster[[i]]$area], add=TRUE, col="lightblue")}

plot(st_geometry(region), add=TRUE) # add contours
legend("bottomleft", c("most likely clusters", "secondary clusters (signif.)", "secondary clusters (insignif.)"), fill=c("red", "blue", "lightblue"), bty="n") # legend

Table 12: SpatScan for rasterised data (20*20) (as in Fig.32a)

Figure 32: High density SpatScan clusters of the number of firms related to a population for raster data: a) 400 raster cells (2020 grid cells), b) 2500 raster cells (5050 grid cells)

2.7 Voronoi clustering

Voronoi tessellation can be performed with the spatstat:: package on objects of the owin and ppp classes or with the sf:: package on objects of the sfc class. In both approaches the coordinate system should be planar, called a Mercator – therefore the codes include the element of transformations of coordinates and projections. A contour map plays the role of a border that cuts (‘clips’) the lines that divide the area between the points. Areas of tessellation tiles can be calculated as nominal and as relative (as a proportion pf the total area). The final result in any scenario can be in sf class as in and out converters between sf:: and spatstat:: are available. An owin class object, that is a contour for analysis, can be created in two ways: using the extreme values (min and max) of the x and y coordinates, or using a contour of the region.

# Fig.33: Voronoi tessellation of point patterns

set.seed(130) # for reproducibility of random pattern
selected<-sample(1:dim(popul)[1], 20, replace=FALSE)
sub.popul<-popul[selected, ]    # subsample of population
par(mar=c(4,4,4,4))         # wider margins of figure
plot(sub.popul[,1:2], pch=21, bg="black", cex=1.2)  # Fig.33a

# Voronoi (Dirichlet) tessellation with spatstat::
install.packages("spatstat")
library(spatstat)
region.owin<-owin(c(min(sub.popul[,1]-0.35), max(sub.popul[,1])+0.35), c(min(sub.popul[,2])-0.35, max(sub.popul[,2])+0.35)) # a bit wider range of bounding box
region.ppp<-ppp(x=sub.popul[,1], y=sub.popul[,2], window=region.owin) # ppp class, based on owin class
plot(region.ppp, "Point pattern")   # plot from ppp class

region.tes<-dirichlet(region.ppp) # Voronoi / Dirichlet tessellation
par(mar=c(1,1,1,1))         # narrow margins of the figure

# Fig.33b – tessellated point pattern
plot(region.tes, main="Tesselated point pattern") # plot of tiles
points(sub.popul[,1:2], bg="black", pch=21) # points as centroids

# convert classes – from spastat:: to sf::
tess.sf<-st_as_sfc(region.tes)
plot(st_geometry(tess.sf)) # plot equivalent to Fig.33b but in sf
points(sub.popul[,1:2], bg="black", pch=21) # points as centroids

Figure 33: Tessellation of a point pattern

# Fig.34: Density clustering based on Voronoi polygons for population data

# Data preparation
selected<-sample(1:dim(popul)[1], 1500, replace=FALSE)# 1500 or 5000
sub.popul<-popul[selected, 1:2] # (x,y) data only
colnames(sub.popul)<-c("x", "y")    # proper column names, just in case
points.sf<-st_as_sf(sub.popul, coords=c("x", "y"), crs=4326, agr="constant") # convert points from data.frame to sf class

# Convert from spherical WGS84 (current) to planar (Mercator) coordinates in sf::
# suffix p in object name stands for ‘planar’
region.p<-sf::st_transform(region, crs=3857)        # contour map
points.sf.p<-sf::st_transform(points.sf, crs=3857)  # point data

# Voronoi tessellation in sf:: (planar projections)
points.sfc<-st_geometry(points.sf.p)
box.sfc<-st_geometry(region.p)
points.union<-st_union(points.sfc)
tess<-st_voronoi(points.union, box.sfc)
tess.clip<-st_intersection(st_cast(tess), st_union(box.sfc))

# areas of tessellation tiles
a.abs<-st_area(tess.clip)   # absolute area of tessellation tiles
a.rel<-a.abs/sum(a.abs) # relative area (%) of tessellation tiles
a.exp<-1/length(a.rel)  # threshold (expected value)

# plot with plot() function
par(mar=c(2,2,2,2))                 # proper margins
brks<-c(0, a.exp, max(a.rel))           # thresholds for colours
cols<-c("darkred","bisque2", "bisque2") # colours
plot(st_geometry(tess.clip), col=cols[findInterval(a.rel, brks)], border="grey50")                  # plot of tiles in colours
legend("bottomleft", legend=c("dense", "not dense"), fill=cols, cex=0.8, bty="n")               # legend with labels
title(main="Density clustering based on Voronoi polygons, 
threshold set as expected area (%)")    # main title

# Data for plotting with ggplot()
tess.sf<-st_as_sf(tess.clip)    # from sfc to sf class
tess.sf$a.rel<-a.rel            # relative size of tiles
tess.sf$ad<-"dense"         # classification of tiles
tess.sf$ad[as.numeric(tess.sf$a.rel)>a.exp]<-"not dense"
ggplot() + geom_sf(tess.sf, mapping=aes(fill=ad))

Figure 34: Voronoi clustering for a sample of points: a) 1500 points, b) 5000 points

Codes for Chapter 3

3.1 SPAG – measure of Spatial Agglomeration

The SPAG measure is an advanced algorithm that produces simple and straightforward results. Many intermediate steps are necessary to obtain SPAG. Therefore, only the automated version using the SPAG() command from the spatialWarsaw:: package is presented. In the first part of the code (Fig.35, Fig.36 and Fig.37), SPAG is tested on the different point patterns, that were generated with the st_sample() function. In the second part of the code (Fig.38 and Fig.39) it is applied to empirical data – firm locations. It is presented for datasets of different sizes and by sectors – in a for() loop.

# Fig.35: SPAG for random distribution of points of different sizes

# Polygon object was created at the beginning of the code

# Point pattern with size value - random distribution, projected
points.random<-st_sample(polygon, 150, type="random") # 150 obs.
crds.random<-as.data.frame(st_coordinates(points.random))
crds.random$size<-sample(1:5, dim(crds.random)[1], replace=TRUE)
random.sf<-st_as_sf(crds.random, coords=c("X", "Y"), crs=4326) 

# Plot in ggplot() version – Fig.35a
ggplot(poly.sf) + geom_sf(fill=NA) + geom_sf(data=random.sf, aes(size=size)) + theme_minimal()+ labs(title="Random distribution") 

# Plot in plot() version – Fig.35a
plot(st_geometry(polygon), main="Random distribution")
plot(st_geometry(points.random), add=TRUE, cex=crds.random$size/2, pch=21, bg="black")
axis(1); axis(2)

library(spatialWarsaw)
my.spag<-SPAG(random.sf, "size", poly.sf)   # Fig.35b

Figure 35: SPAG for random distribution: a) random distribution, b) SPAG Index

# Fig.36: SPAG for regular distribution of points of different sizes

# Point pattern with size value - regular distribution, projected
points.regular<-st_sample(polygon, 150, type="regular")
crds.regular<-as.data.frame(st_coordinates(points.regular))
crds.regular$size<-sample(1:5, dim(crds.regular)[1], replace=TRUE)
regular.sf<-st_as_sf(crds.regular, coords=c("X", "Y"), crs=4326) 

# Plot in ggplot() version – Fig.36a
ggplot(poly.sf) + geom_sf(fill=NA) + geom_sf(data=regular.sf, aes(size=size)) + theme_minimal() + labs(title="Regular distribution") 

# Plot in plot() version – Fig.36a
par(mar=c(2,2,2,2))
plot(st_geometry(polygon), main="Regular distribution")
plot(st_geometry(points.regular), add=TRUE, cex=crds.regular$size/2, pch=21, bg="black")
axis(1); axis(2)

my.spag<-spatialWarsaw::SPAG(regular.sf, "size", poly.sf) # Fig.36b

Figure 36: SPAG for regular distribution: a) regular distribution, b) SPAG Index

# Fig.37: SPAG for clustered points of different sizes

points.clust<-data.frame(X=rnorm(150, mean=0.5, sd=0.1), Y=rnorm(150, mean=0.5, sd=0.1)) # generate clustered coordinates
points.clust$size<-sample(1:5, dim(points.clust)[1], replace=TRUE)
clust.sf<-st_as_sf(points.clust, coords=c("X", "Y"), crs=4326) 

# plot in ggplot() version – Fig.37a
ggplot(poly.sf) + geom_sf(fill=NA) + geom_sf(data=clust.sf, aes(size=size)) + theme_minimal() + labs(title="Clustered distribution") 

# plot in plot() version – Fig.37a
plot(st_geometry(polygon), main="Clustered distribution")
axis(1); axis(2)
plot(st_geometry(clust.sf), add=TRUE, cex=points.clust$size/2, pch=21, bg="black")

my.spag<-spatialWarsaw::SPAG(clust.sf, "size", poly.sf) # Fig.37b

Figure 37: SPAG for clustered distribution: a) clustered distribution, b) SPAG Index

# Fig.38 SPAG for different scenarios

library(spatialWarsaw)

# Fig.38a & Fig.38b
firms.sf$size<-round(firms.sf$empl,0) # all sectors
nn<-2000  # 50000 or 2000, set sample size
selector<-sample(1:dim(firms.sf)[1], nn, replace=FALSE)
my.spag<-SPAG(firms.sf[selector,], "size", region)

# Fig.38c & Fig.38d
nn<-50000 
selector<-sample(1:dim(firms.sf)[1], nn, replace=FALSE)

my.spag<-SPAG(firms.sf[selector & firms.sf$SEK_PKD7=="A",], "size", region) # Sector A

my.spag<-SPAG(firms.sf[selector & firms.sf$SEK_PKD7=="M",], "size", region) # Sector N

Figure 38: SPAG in alternative scenarios: a) for the sample of 2,000 observations and all sectors; b) for the sample 50,000 observations and all sectors; c) for sector A (agriculture); d) for sector M (professional and scientific activity)

# Fig.39: SPAG in a loop (and its components by sector)

# Names of sectors for each iteration
sectors<-c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U")

# output object
SPAG_sectors<-matrix(0, nrow=21, ncol=5)
colnames(SPAG_sectors)<-c("SPAG", "i.cov", "i.dist", "i.overlap", "n.obs")
rownames(SPAG_sectors)<-sectors

for(i in 1:21){ 
my.spag<-SPAG(firms.sf[firms.sf$SEK_PKD7==sectors[i],], "size", region)
SPAG_sectors[i,1]<-my.spag$SPAG
SPAG_sectors[i,2]<-my.spag$i.coverage
SPAG_sectors[i,3]<-my.spag$i.distance
SPAG_sectors[i,4]<-my.spag$i.overlap
SPAG_sectors[i,5]<-my.spag$n.obs
}
SPAG_sectors # print output (here three lines only)

# Line plot of SPAG by sector – Fig.39a
plot(SPAG_sectors[,1], type="l", ylim=c(0,0.3))
points(SPAG_sectors[,1], pch=21, bg="grey35", cex=1.6)
text(1:21, SPAG_sectors[,1]+0.02, sectors)
abline(h=0:10*0.05, lty=3, col="grey60", lwd=2)
abline(h=0:30*0.01, lty=3, col="grey80")
legend(5, 0.3, legend=c("SPAG"), lty=c(1), lwd=c(1), pch=21, pt.bg=c("grey35"), bty="n")

# Line plot of SPAG components by sector – Fig.39b
plot(SPAG_sectors[,3], type="l", ylim=c(0,1))
points(SPAG_sectors[,3], pch=21, bg="grey80", cex=1.6)
lines(SPAG_sectors[,4])
points(SPAG_sectors[,4], pch=21, bg="grey50", cex=1.6)
text(1:21, SPAG_sectors[,3]+0.05, sectors)
text(1:21, SPAG_sectors[,4]-0.05, sectors)
abline(h=0:100*0.1, lty=3, col="grey60", lwd=2)
legend(5, 1, legend=c("distance index", "overlap index"), lty=c(1,1), lwd=c(1,1), pch=21, pt.bg=c("grey80", "grey50"), bty="n")

# SPAG for population (mentioned in the text)
popul$ones<-rep(1, times=dim(popul)[1])
spag.popul<-SPAG(popul, "ones", region)

Figure 39: SPAG index and its components by sector: a) SPAG index, b) components of SPAG: distance and overlap indices

3.2 Entropy of density to measure agglomeration

3.2.1 Clark-Evans test

The Clark-Evans test is a very powerful and intuitive solution for agglomeration analysis. It is therefore worth presenting operational details – to illustrate how it works. The first part of the code (Fig.40) generates different point patterns with only a few observations and shows how the elements of this test are being calculated. The second part of the code (Fig.41) applies the Clark-Evans test to empirical data – by sector, which makes the results comparable to SPAG (Fig.39). The Clark-Evans test is available in spatstat:: and this solution was presented for Fig.41. However, as the test is based on the distance to the first nearest neighbour, the illustrative solution in Fig.40 uses the dbscan:: package and the kNN() function to compute such a distance.

# Fig.40: Clark-Evans test for different theoretical patterns 

library(dbscan)
# Polygon object was created at the beginning of code

#Fig.40a - random distribution
points.random<-st_sample(polygon, 15, type="random") # sampling
plot(st_geometry(polygon))  # plot of contour
plot(st_geometry(points.random), add=TRUE) # plot of sampled points

# find the first nearest neighbour and distance to this point
pts.random<-as.data.frame(st_coordinates(points.random)) # xy crds
knn.pts.random<-dbscan::kNN(as.matrix(pts.random),1) # find knn=1
m.dist.rand<-mean(knn.pts.random$dist) # mean distance to knn=1

pts.random$knn<-as.vector(knn.pts.random$id[,1]) # save new variable
pts.random$dist<-as.vector(knn.pts.random$dist[,1])
pts.random

# Plot points and lines to the nearest point
plot(pts.random[,1:2], main="Random point pattern", ylim=c(0,1), xlim=c(0,1), pch=21, bg="black") # plot of points
for(i in 1:dim(pts.random)[1]){  # loop for all points to draw lines
lines(pts.random[c(i, pts.random$knn[i]),1], pts.random[c(i,pts.random$knn[i]),2])} # from (X0,X1) to (Y0,Y1)
text(0.3, 0, paste0("Average distance to knn=1 equals ", round(m.dist.rand,2)))

#Fig.40b - regular distribution
points.regular<-st_sample(polygon, 15, type="regular") # sampling
plot(st_geometry(polygon))           # plot of contour
plot(st_geometry(points.regular), add=TRUE) # plot of sampled points

pts.regular<-as.data.frame(st_coordinates(points.regular))
knn.pts.regular<-kNN(as.matrix(pts.regular),1) # find knn=1
m.dist.reg<-mean(knn.pts.regular$dist) # average distance to knn=1

pts.regular$knn<-as.vector(knn.pts.regular$id[,1]) # save variable
pts.regular$dist<-as.vector(knn.pts.regular$dist[,1])
pts.regular
plot(pts.regular[,1:2], main="Regular point pattern", ylim=c(0,1), xlim=c(0,1), pch=21, bg="black") # plot of points
for(i in 1:dim(pts.regular)[1]){ # loop to draw lines
lines(pts.regular[c(i, pts.regular$knn[i]),1], pts.regular[c(i, pts.regular$knn[i]),2])} # from (X0,X1) to (Y0,Y1)
text(0.3, 0, paste0("Average distance to knn=1 equals ", round(m.dist.reg,2)))

#Fig.40c clustered distribution
pts.clust<-data.frame(X=rnorm(15, mean=0.5, sd=0.1), Y=rnorm(15, mean=0.5, sd=0.1)) # generate clustered points
plot(st_geometry(polygon)) # plot contour
points(pts.clust[,1:2], col="red") # plot points in red

knn.pts.clust<-kNN(as.matrix(pts.clust),1) # find knn=1
m.dist.clust<-mean(knn.pts.clust$dist) # mean distance to knn=1

pts.clust$knn<-as.vector(knn.pts.clust$id[,1]) # save variable
pts.clust$dist<-as.vector(knn.pts.clust$dist[,1])
pts.clust
plot(pts.clust[,1:2], main="Clustered point pattern", ylim=c(0,1), xlim=c(0,1), pch=21, bg="black") # plot points
for(i in 1:dim(pts.clust)[1]){ # loop to draw lines between points
lines(pts.clust[c(i, pts.clust$knn[i]),1], pts.clust[c(i, pts.clust$knn[i]),2])} # from (X0,X1) to (Y0,Y1)
text(0.3, 0, paste0("Average distance to knn=1 equals ", round(m.dist.clust,2)))

#Fig.40d - hexagonal distribution
points.hex<-st_sample(polygon, 15, type="hexagonal") # sampling
plot(st_geometry(polygon)) # plot contour
plot(st_geometry(points.hex), add=TRUE) # plot points

pts.hex<-as.data.frame(st_coordinates(points.hex)) # coordinates
knn.pts.hex<-kNN(as.matrix(pts.hex),1) # find knn=1
m.dist.hex<-mean(knn.pts.hex$dist) # mean distance to knn=1

pts.hex$knn<-as.vector(knn.pts.hex$id[,1]) # save a new variable
pts.hex$dist<-as.vector(knn.pts.hex$dist[,1])
pts.hex
plot(pts.hex[,1:2], main="Hexagonal point pattern", ylim=c(0,1), xlim=c(0,1), pch=21, bg="black") # plot points
for(i in 1:dim(pts.hex)[1]){ # loop to plot lines between points
lines(pts.hex[c(i, pts.hex$knn[i]),1], pts.hex[c(i, pts.hex$knn[i]),2])} # from (X0,X1) to (Y0,Y1)
text(0.3, 0, paste0("Average distance to knn=1 equals ", round(m.dist.hex,2)))

# Proportion of average distances between different distributions
m.dist.clust/m.dist.rand    # in clustered to random pattern
m.dist.reg/m.dist.rand  # in regular to random pattern
m.dist.hex/m.dist.rand  # in hexagonal to random pattern

Figure 40: Average distance measures to the first nearest neighbour in different point patterns: a) random, b) clustered, c) regular, d) hexagonal

# Fig.41: Clark-Evans test for different sectors

library(spatstat) # spatstat:: includes spatstat.geom::

# convert contour from sf to spatstat::, first planar projections
region.merc<-sf::st_transform(sf::st_as_sf(region), crs=sf::st_crs(3857))
region.owin<-spatstat.geom::as.owin(sf::st_geometry(region.merc))

# Convert the data to spatstat:: (planar projections required)
nn<-50000   # subsample size
selector<-sample(1:dim(firms)[1], nn, replace=FALSE) # subsample
firms.sf<-st_as_sf(firms[selector,], coords=c("lon", "lat"), crs=4326) # WGS84 spherical projections (as Google), sf class
firms.sf2<-sf::st_transform(sf::st_as_sf(firms.sf), crs=sf::st_crs(3857)) # planar projections (Mercator), sf class
firms.ppp<-ppp(x=st_coordinates(firms.sf2)[,1], y=st_coordinates(firms.sf2)[,2],window=region.owin, marks=as.factor(firms.sf2$SEK_PKD7)) # firms in ppp class 
ce<-clarkevans.test(firms.ppp) # Clark-Evans test for all data

# Output objects for loop
tab<-table(firms.ppp$marks)     # table with unique names of sectors
sectors<-names(tab)         # names of sectors
no.of.sectors<-length(sectors)  # number of sectors
ce.tab<-matrix(0, nrow=no.of.sectors, ncol=2) # empty output matrix

for(i in 1:no.of.sectors){      # loop through sectors
firms.sel<-firms.ppp[firms.ppp$marks==sectors[i],] # sectoral subset
ce<-clarkevans.test(firms.sel)  # Elark-Evans test for subsample
ce.tab[i,2]<-ce$statistic       # save test value
ce.tab[i,1]<-sectors[i]}        # save sector name

ce.tab<-as.data.frame(ce.tab)   # format data as data.frame
ce.tab[,2]<-as.numeric(as.character(ce.tab[,2])) # convert

# Plot the Clark-Evans test by sector
plot(ce.tab[,2], type="l", ylim=c(0,1), axes=FALSE, xlab="", ylab="") # line plot
points(ce.tab[,2], pch=21, bg="grey80", cex=1.6) # add points 
axis(1, at=1:no.of.sectors, labels=sectors, cex.axis=0.7)   # x axis
axis(2)                                     # y axis
text(1:20, ce.tab[,2]+0.05, sectors)    # add text next to the points
abline(h=0:10*0.1, lty=3, col="grey80") # horizontal lines
title(main="The Clark-Evans test by sectors") # title

Figure 41: The Clark-Evans test for sectoral subsamples of firms

3.2.2 ETA (Entropy-Tessellation-Agglomeration) measure of agglomeration

The ETA statistic (Entropy-Tesselation-Agglomeration) was presented as a step-by-step solution (Fig.42) for simulated data, and as an automated solution using the ETA() command from the spatialWarsaw:: package (Fig.43) for empirical data. For Fig.42 it involves generating point patterns (random, regular, hexagonal, clustered) with st_sample() from the sf:: package, then performing the Voronoi tessellation on this pattern within the fixed boundaries (with st_voronoi()), and finally one is calculating the entropy where the inputs are the proportions of the tessellation tiles in the whole area. ETA() from spatialWarsaw:: (Fig.43) is based on the same code that was presented for Fig.42.

# Fig.42: ETA (Entropy-Tesselation-Agglomeration) for pure point patterns (simulated data)

# Generate point patterns – please run separately for each pattern
points.random<-st_sample(polygon, 15, type="random") # random points
points.sfc<-st_geometry(points.random)

points.regular<-st_sample(polygon, 15, type="regular") # regular
points.sfc<-st_geometry(points.regular)

points.clust<-data.frame(X=rnorm(15, mean=0.5, sd=0.07), Y=rnorm(15, mean=0.5, sd=0.07)) # data.frame class, clustered pattern
points.sf<-st_as_sf(points.clust, coords=c("X", "Y")) # sf class
points.sfc<-st_geometry(points.sf)

points.hexagonal<-st_sample(polygon, 15, type="hexagonal") # hexagon
points.sfc<-st_geometry(points.hexagonal)

# Plot with ggplot() – plot it for each point pattern separately
ggplot(polygon) + geom_sf(fill=NA) + geom_sf(data=points.sfc) + theme_minimal() + labs(title="Point pattern – spatial distribution") 

# Plot with plot() – plot it for each point pattern separately
plot(st_geometry(polygon), axes=TRUE, main="Point pattern")
plot(st_geometry(points.sfc), add=TRUE, bg="black", pch=21) 

# Convert of objects and tessellations in sf:: - for each pattern
box.sfc<-st_geometry(polygon)
points.union<-st_union(points.sfc)
tess<-st_voronoi(points.union, box.sfc)
tess.clip<-st_intersection(st_cast(tess), st_union(box.sfc))

# Nominal and relative areas of tiles as input to entropy
a.nom<-st_area(tess.clip)           # nominal area of tiles
a.rel<-a.nom/sum(a.nom)         # relative area of tiles
ent.nom<-sum(-1*a.rel*log(a.rel))   # nominal entropy H
n<-length(a.nom)                    # number of tiles
ent.max<-log(1/n)*(-1)          # maximum entropy Hmax
ent.rel<-ent.nom/ent.max            # relative entropy Hrel

# Tessellation plot (with points in blue) – Fig.42(a,b,c,d)
par(mar=c(4,4,4,4))             # proper margins
plot(st_geometry(tess.clip), main="Degree of agglomeration
entropy of tesselated point pattern", sub="Relative H entropy =1 for spatially uniform distribution (no agglomeration)", axes=TRUE)
plot(st_geometry(points.sfc), add=TRUE, bg="darkblue", pch=21) 
legend(0.001, 0.99, horiz=FALSE, c(paste("Shannon H =", round(ent.nom,2)), paste("Relative H =", round(ent.rel,2)), paste("N =", round(n,2))), cex=0.75, bty="n")

Figure 42: Entropy-Tessellation-Agglomeration (ETA) values for simulated point patterns: a) random, b) regular, c) clustered, d) hexagonal

# Fig.43: ETA (Entropy-Tessellation-Agglomeration) for empirical data (different sample size, n=500 and n=5000)

library(spatialWarsaw)
eta.500<-ETA(firms.sf, region, 500) # Fig.43a for 500 obs.
eta.5000<-ETA(firms.sf, region, 5000) # Fig.43b for 5000 obs.

Figure 43: ETA (Entropy-Tessellation-Agglomeration) values for empirical business location data: a) sample size n=500, b) sample size n=5000

# Fig.44 ETA by sectors in a loop

sectors<-c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U")

eta.sectors<-matrix(0, nrow=21, ncol=1) # object for the loop

for(i in 1:21){             # start of the loop
sec<-sectors[i]             # select sector name
my.eta<-ETA(firms.sf[firms.sf$SEK_PKD7==sec,], region, 500)
eta.sectors[i]<-my.eta$H_ent}   # save results

# Plot ETA by sectors
plot(eta.sectors[,1], type="l", ylim=c(0.3,1), axes=FALSE, xlab="", ylab="") # line chart
points(eta.sectors[,1], pch=21, bg="deeppink3", cex=1.6) # points
axis(1, at=1:21, labels=sectors, cex.axis=0.7)  # x axis
axis(2)                         # y axis
text(1:21, eta.sectors[,1]+0.05, sectors)   # text next to points
abline(h=0:10*0.1, lty=3, col="grey80") # horizontal lines
title(main="ETA by sectors")            # title

Figure 44: Degree of agglomeration of sectors measured with ETA

3.2.3 Focal Local Entropy (FLE)

The Focal Local Entropy concept was presented as a step-by-step solution (Fig.45) and as an automated solution using the FLE() command from spatialWarsaw:: package (Fig.46) – both examples are for the empirical data. It involves rasterising the point pattern (counting points in raster cells) with rast() and rasterize() from terra::, normalising the result with scale(), checking for each cell (with the moving window mechanism) which cells are the closest neighbours with focal() from terra::, and calculating the entropy for these cells with a custom function created with function() command. The entropy function has fixed intervals - due to normalisation, they express standard deviations (after normalisation the standard deviation is 1 and the average is 0). The intervals may cover a wide range of observations, as the breaks are from -100 to +100 standard deviations, with a focus on the most likely one from -3 to +3, with the step of 1 or 0.5 standard deviations. The function table() organises the observations from the focal neighbourhood into intervals – it counts how many cells belong to a given interval and returns relative values (nominal counts divided by total value).

# Fig.45: FLE (Focal Local Entropy) for squared moving window – step by step 

# Create a raster
library(terra)      # for rast() to rasterise data
nrows.raster<-50        # size of raster in rows
ncols.raster<-50        # size of raster in columns
bb<-st_bbox(region)     # bounding box of contour in sf class
r<-rast(nrows=nrows.raster, ncols=ncols.raster, xmin=bb[1], ymin=bb[2], xmax=bb[3], ymax=bb[4]) # raster with terra::

# entropy function for normalised variable
# pi object counts proportions of obs. in given intervals
breaks=c(-100, -5, -3, -2, -1, -0.5, 0, 0.5, 1, 2, 3, 5, 100)
entropy.f<-function(y) {pi=table(cut(y, breaks=breaks))/sum(table(cut(y, breaks=breaks))); -sum(pi[pi>0]*log(pi[pi>0]))}

# prepare population data
popul$ones<-rep(1, times=dim(popul)[1]) # values to sum up
rast.popul<-rasterize(as.matrix(popul[,1:2]), r, value=popul$ones,  fun=sum) # we use crds as matrix and raster as aggregating box 
rast.popul.vec<-as.vector(rast.popul$sum) # save output as vector
rast.popul.vec[is.na(rast.popul.vec)==TRUE]<-0 # NA changed to 0

# Normalise values in a raster
rp<-r                           # additional object
values(rp)<-scale(rast.popul.vec)   # normalise values
fle<-focal(rp, w=3, fun=entropy.f)  # focal function, option w=7

# Fig.45a
library(viridis)
viri_cols<-viridis(n=10, option="F")
plot(fle, main="Focal local entropy, w=3", col=rev(viri_cols), colNA="white") # rev() is for inverted order of colours
plot(st_geometry(region), add=TRUE)

# Fig.45b
fle<-focal(rp, w=7, fun=entropy.f)  # focal function, option w=7
plot(fle, main="Focal local entropy, w=7", col=rev(viri_cols), colNA="white") # rev() is for inverted order of colours
plot(st_geometry(region), add=TRUE)
# automated solution with spatialWarsaw:: - for Fig.45b
library(spatialWarsaw)
my.FLE<-FLE(sub.popul.sf, region, nrows.raster=50, ncols.raster=50, w=7) # the same solution but with spatialWarsaw::

Figure 45: Focal local entropy of population for square moving window: a) size 33, b) size 77

# Fig.46: FLE (Focal Local Entropy) for radial moving window – automated

library(spatialWarsaw)
my.FLE<-FLE(sub.popul.sf, region, nrows.raster=50, ncols.raster=50, r=0.1)  # radius r=0.1 (~10km), Fig.46a
my.FLE<-FLE(sub.popul.sf, region, nrows.raster=50, ncols.raster=50, r=0.2)  # radius r=0.2 (~20km), Fig.46b

Figure 46: Focal local entropy of population for radial moving window: a) radius 0.1 (~10 km), b) radius 0.2 (~20 km)

3.3 Entropy and mutual information of density clusters in understanding business structure

The first part of the code (Fig.47) generates all the graphical elements that explain the concept of mutual information. It is intended to illustrate the composition of geometric shapes inside the clusters (inside the rings).

# Fig.47: Generating clustering scenarios for NMI
library(plotrix) # to draw a circle with draw.circle()

par(mar=c(1,1,1,1))         # narrow margins

# left plot, Fig.47a
plot(st_geometry(polygon))  # plot the box
#axis(1); axis(2)           # plot the axes (here suspended)

draw.circle(0.26,0.7, radius=0.25, lwd=2) # rings for clusters
draw.circle(0.74,0.3, radius=0.25, lwd=2)

# draw individual elements: pch=15 is square, pch=16 is circle, pch=17 is triangle

# locate geometric elements in (x,y)
points(c(0.2, 0.25, 0.30), c(0.80, 0.85, 0.90), pch=16, cex=1.5)
points(c(0.2, 0.25, 0.30, 0.21, 0.26, 0.31)+0.1, c(0.50, 0.55, 0.6, 0.46, 0.51, 0.56)+0.1, pch=17, cex=2.5)
points(c(0.7, 0.75, 0.80, 0.72, 0.77), c(0.20, 0.25, 0.3, 0.29, 0.35), pch=15, cex=2.5)
text(0.8, 0.95, "Scenario 1")

# right plot, Fig.47b
plot(st_geometry(polygon))  # plot the box
#axis(1); axis(2)           # plot the axes (here suspended)

# locate geometric elements in (x,y)
draw.circle(0.30,0.7, radius=.25, lwd=2)
draw.circle(0.74,0.3, radius=.25, lwd=2)

points(c(0.2, 0.25), c(0.80, 0.85), pch=16, cex=1.5)
points(c(0.2, 0.25), c(0.60, 0.65), pch=15, cex=1.5)
points(c(0.2, 0.25, 0.30, 0.22, 0.27)+0.2, c(0.50, 0.55, 0.6, 0.46, 0.51)+0.2, pch=17, cex=2.5)

points(c(0.85, 0.9, 0.95), c(0.20, 0.25, 0.3), pch=15, cex=1.5)
points(c(0.6), c(0.20), pch=16, cex=1.5)
points(c(0.85, 0.9, 0.95)-0.15, c(0.20, 0.25, 0.3)+0.2, pch=17, cex=2.5)
text(0.8, 0.95, "Scenario 2")

Figure 47: Clustering scenarios for NMI (Normalised Mutual Information)

The second part of this code (Tab.13) presents an automated solution for the mutual information statistic using the NMI() function from the aricode:: package. It is important that almost all variables used in the analysis are dummies – with values of zero and one only (the exceptions are “PKD7”, “SEK_PKD7” which are multi-category variables). The analysis runs in a double for() loop to check all possible pairs from these sets of variables.

# Tab.13: Normalised Mutual Information (NMI) values for density clusters and relative peripherality clusters 

# Variables used in the analysis
vec.cols<-c("COREfirms", "COREpopul", "dist_core_10", "dist_core_25", "dist_core_50", "dist_midsize_10", "dist_midsize_25", "dist_midsize_50", "dist_regional_10", "dist_regional_25", "dist_regional_50", "dist_localbig_10", "dist_localbig_25", "dist_localbig_50", "dist_localsmall_10", "dist_localsmall_25", "dist_localsmall_50") # 17 var. in columns

vec.rows<-c("dummy_ifbig", "dummy_prod", "dummy_constr", "if_hightech", "PKD7", "SEK_PKD7", "dummy_serv", "dummy_agri", "COREpopul") # 9 variables in rows

# Prepare output object to execute the loop
nmi<-matrix(0, nrow=9, ncol=17)
rownames(nmi)<-vec.rows # names of rows
colnames(nmi)<-vec.cols # names of columns

install.packages("aricode")
library(aricode)

for(i in 1:17){     # loop iterating by columns
for(j in 1:9){      # loop iterating by rows
nmi[j,i]<-NMI(firms[,vec.cols[i]], firms[,vec.rows[j]]) # NMI
}}              # end of double loop
round(nmi,4)        # print results, round the displayed values

Table 13: Normalised Mutual Information (NMI) values for density clusters and relative peripherality clusters

3.4 Kernel Density Estimation (KDE)

3.4.1 Traditional Kernel Density Estimation (KDE)

Kernel estimation is based on a weighting function that determines how strong the influence of near and far observations is. Kernels can be visualised as the density functions. Fig.48 compares kernels that are available in R. It extracts the solutions included in the density() command from stats::. One of the KDE solutions available in R is in spatstat:: in the density() function (note that this is not the same function as density() from stats::). It outputs in the (x,y,z) format, which can be plotted well with image.plot() from the fields:: package. The bounding box for the analysis is in the owin class – to keep results comparable it should be calculated from the full sample. It is calculated for Fig.49 and used in further code. In the case of Fig.49 and Fig.50, plotting the contour of the region first makes the figure larger and clearer, with smaller borders, no colour bar and no automatic title. As the area of the density plot overlaps the contour of the region, it must be plotted again. Skipping the first line with the contour plot gives larger white borders, a smaller density plot, a visible colour bar on the right and an automatic title at the top.

# Fig.48: Comparison of kernels
# modified example from density() function 
kernels<-eval(formals(density.default)$kernel)
# there are the following kernels: 
# [1] "gaussian"     "epanechnikov" "rectangular"  "triangular"   "biweight"    
# [6] "cosine"       "optcosine"   

par(mar=c(2,2,1,1)) # wider bottom and left, narrower top and right
plot(density(0, bw=1), xlab="", main="") # first kernel
for(i in 2:length(kernels))     # remaining kernels
lines(density(0, bw=1, kernel=kernels[i]), col=i, lwd=2)
legend(1.45, 0.4, legend=kernels, col=seq(kernels), lty=1, cex=0.75, y.intersp=1, bty="n") # plot in a loop for different kernels

Figure 48: Comparison of different kernels

# Fig.49 & Fig.50: KDE for the same pattern but with different bandwidths

# Sample of firms
selector<-sample(1:dim(firms)[1], 50000, replace=FALSE)
sub.firms<-firms[selector, ] # subset of firms

# colours from viridis palette
library(viridis)
viri_cols<-rev(viridis(n=20, option="F")) # colours in reverse order

# spatstat:: objects to calculate KDE
library(spatstat)   # to compute KDE with density() 

# Objects in spatstat class (owin and ppp)
obs_window<-owin(xrange=range(firms$lon), yrange=range(firms$lat))
firms.ppp<-ppp(x=sub.firms$lon, y=sub.firms$lat, window=obs_window)

# Kernel density estimation (KDE) – default bandwidth - Fig.49a
firms.kde<-density(firms.ppp, edge=T, kernel="gaussian")
plot(st_geometry(region), border="grey60")
plot(firms.kde, col=viri_cols, add=TRUE) 
plot(st_geometry(region), add=TRUE, border="grey60")

# KDE with arbitrary set bandwidth – Fig.49b
firms.kde<-density(firms.ppp, edge=T, kernel="gaussian", 0.05)
plot(st_geometry(region), border="grey60")
plot(firms.kde, col=viri_cols, add=TRUE)
plot(st_geometry(region), add=TRUE, border="grey60")

Figure 49: Spatial KDE (Kernel Density Estimation) for the same pattern and different bandwidths: a) default bandwith, b) narrow bandwidth (0.05)

# KDE for a selected sector (PKD7): A (agriculture) or F (service)
# Change the sector for the proper one (A or F) – Fig.50a & Fig.50b
firms.sect.ppp<-ppp(x=firms$lon[firms$SEK_PKD7=="F"], y=firms$lat[firms$SEK_PKD7=="F"], window=obs_window)
firms.sect.kde<-density(firms.sect.ppp, edge=T, kernel="gaussian", 0.05) # KDE for selected sector

plot(st_geometry(region), border="grey60")
plot(firms.sect.kde, col=viri_cols, main="Spatial distribution of selected sector (F)", add=TRUE) 
plot(st_geometry(region), add=TRUE, border="grey60")

Figure 50: Spatial KDE (Kernel Density Estimation) for different point patterns (sectors) and the same bandwidth: a) agriculture, b) construction (F)

# Tab.14: KDE output (continued from code for Fig.50)

firms.sect.kde$v            # matrix of KDE values 128 * 128
firms.sect.kde$xcol     # x coordinates
firms.sect.kde$yrow     # y coordinates

Table 14: Typical raw output from KDE in the form of a) matrix and b) x and y coordinates

3.4.2 K-means clustering of KDE

K-means clustering can be applied to a number of variables as well as to a single variable. It always finds the groups of similar values – in ‘single-criterion’ or ‘multi-criterion’ ranking. The user specifies the number of clusters – based on some metrics and/or arbitrary. The size of these clusters and the values of the centroids (the most middle points) are the results of the clustering. For a single variable, the k-means clustering is the data-driven division into intervals – the intervals are not set subjectively (e.g. by user) but follow the empirical distribution of the data. In the case of the KDE object (in Fig.51, the outcome of the spatstat:: density() command) one clusters the ‘z’ dimension which are the values of the kernel function stored in a vector. It should be noted that the cluster ID labels (e.g. cluster 1, cluster 2, …) are randomly assigned – there is no sorting mechanism that assigns labels according to some rule.

# Fig.51: Clustered KDE for different sectors (continued from code for Fig.50)

# ppp object for spatstat:: for selected sector (choose A, H, S, U)
firms.sector.ppp<-ppp(x=firms$lon[firms$SEK_PKD7=="A"], y=firms$lat[firms$SEK_PKD7=="A"], window=obs_window, check=T)

firms.sector.kde<-density(firms.sector.ppp, edge=T, kernel="gaussian", 0.05) # KDE in spatstat::

# k-means clustering into 10 clusters
km.sector<-stats::kmeans(as.vector(firms.sector.kde$v), 10)
km.sector$size  # number of observations in each cluster

# image plot – cluster IDs instead of KDE values
# clustering vector is formatted as matrix by rows
library(fields)
library(viridis)
viri_cols<-rev(viridis(n=10, option="F")) # colours in reverse order
par(mar=c(2,2,2,2)) # wider margins
image.plot(firms.sector.kde$xcol, firms.sector.kde$yrow, matrix(km.sector$cluster, ncol=128, nrow=128, byrow=TRUE), main="Clustering k=10 of KDE for selected sector (A)", col=viri_cols)# from fields::

Figure 51: K-means clustered KDE for different spatial patterns (sectors): a) sector A (Agriculture, forestry, hunting and fishing), b) sector H (Transportation and storage), c) sector S (Other service activities), d) sector U (Organisations and extraterritorial teams)

3.4.3 Entropy of clustered KDE

The previous code (Fig.51) clustered normalised KDE values. This code goes one step further and applies entropy to k-means clusters. It checks the distribution of the KDE values in a predefined number of clusters – it can be equal (10% in each cluster in case of 10 groups) or diversified (e.g. 80% in one cluster, 10% in another cluster and the remaining clusters are empty). Entropy is a measure of the (in)equality of this structural distribution. In the example for Fig.52, the entropy calculations are repeated for each sector (from A to U) (automated by a for() loop) and plotted.

# Fig.52: Relative entropy of clustered KDE for sectoral point patterns (continued from code for Fig.49 and 50)

sectors<-c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U")

entropy<-matrix(0, nrow=21, ncol=1) # object for the loop

for(i in 1:21){     # start of the loop
sec<-sectors[i]     # select sector name

# spatstat:: object
firms.ppp<-spatstat::ppp(x=firms$lon[firms$SEK_PKD7==sec], y=firms$lat[firms$SEK_PKD7==sec], window=obs_window, check=T)
assign(paste0("firms.ppp.",sec), firms.ppp) # saving an object

# KDE in spatstat::
firms.kde<-density(firms.ppp, edge=T, kernel="gaussian", 0.05)
assign(paste0("firms.kde.",sec), firms.kde) # save object

#fields::image.plot(firms.kde$xcol, firms.kde$yrow, firms.kde$v, main=paste0("Spatial distribution of sector ", sec)) # plot KDE

# k-means clustering of KDE values
km<-kmeans(as.vector(firms.kde$v), 10)
assign(paste0("km.", sec), km) # saving an object

image.plot(firms.kde$xcol, firms.kde$yrow, matrix(km$cluster, ncol=128, nrow=128, byrow=TRUE), main=paste0("Clustering k=10 of KDE for sector ", sec)) # plotting clustered KDE
savePlot(filename=paste0("kmeans_sect_", sec), type="jpg")

# calculating entropy and saving data to object
vec<-km$size/sum(km$size)   # relative size of intervals (always 10)
ee<-(-1)*sum(vec*log(vec))  # nominal entropy (for equal allocation)
ee.rel<-ee/log(length(vec))# relative entropy
entropy[i,1]<-ee.rel        # save relative entropy to loop object
}                   # end of the loop

# plot calculated entropy for each sector
par(mar=c(2,2,2,2))
plot(entropy[,1], type="l", sub="Sectors A:U", main="Entropy of spatial distributions by sectors", ylim=c(0, 0.8), ylab="entropy", xlab="") # line plot
abline(h=(0:10)/10, lty=3, col="grey80")    # horizontal lines
points(entropy[,1], pch=21, bg="red")       # points 
text(entropy[,1]+0.05, sectors)     # text next to points

Figure 52: Relative entropy of clustered KDE for sectoral point patterns

3.4.4 Rand Index for clustered KDE

This code presents the Rand Index for the k-means clustered KDE values illustrated in Fig.51 (and does not refer to entropy analysis from Fig.52). Clustered data for an image is reported as a vector of natural numbers from 1 to 10 (in the case of k=10 clusters) stored for each pixel of an image. Cluster IDs are randomly assigned – for example, for two similar distributions, the empty areas outside the contour may be assigned to different cluster IDs in both settings. Therefore, the similarity check uses the Rand Index to compare pairs of pairs of pixels rather than direct cluster IDs. The Rand Index in this approach is calculated using the clv:: package. For 21 sectors, one gets 441 pairs of sectors that are examined, while the results are stored in a 21*21 matrix (the diagonal compares the same sectors so the Rand Index (RI) is equal to 1). The matrix can be plotted using plot.matrix(). Diagonal values of RI=1 have been changed to RI=NA (RI for the same vectors) in order to obtain the real maximum RI value for two different vectors – in order to better distribute colours to values. The overall summary of RI values can be presented with a (statistical) density plot.

One technical aspect of storing and referencing objects should be noted. In a previous code (for Fig.52), k-means output objects were given specific names that were related to sectors using the assign() function. In this code (Fig.53), the get() command is used to ‘remind’ the objects that they are not just a text, but they have a content. The paste0() function creates the labels in the same way as in Fig.52 and Fig.53 – it merges text without any spaces.

# Fig.53 & Tab.15: Rand Index for k-means clustered sectoral KDE (continued from code for Fig.49, Fig.50 and Fig.51)

library(clv)        # for Rand Index
library(plot.matrix)    # to plot the matrix

rand<-matrix(0, nrow=21, ncol=21)   # object for a loop
colnames(rand)<-sectors # sector names assigned as column names
rownames(rand)<-sectors # sector names assigned as row names

# Rand Index for pairs of k-means clustering vectors (of KDE values)
for(i in 1:21){ # for each sector   # start of the first loop
for(j in 1:21){ # for each sector   # start of the second loop
A<-get(paste0("km.", sectors[i]))   # sectors from previous code
B<-get(paste0("km.", sectors[j]))   # sectors from previous code
external.ind<-std.ext(A$cluster, B$cluster) # object for clv::
rand[i,j]<-clv.Rand(external.ind)       # Rand Index 
}}                      # end of the double loop

rand          # Tab.15

diag(rand)<-NA  # change diagonal values from 1 to NA 
plot(rand)      # better plotting with a real maximum of Rand index

library(RColorBrewer)       # for colour palette
p6<-brewer.pal(n=6, name="YlGn")    # 6 colours, yellow to green
par(mar=c(3,3,3,4))     # wide margins
plot(rand, col=p6, na.col="white", main="Rand Index between sectors", xlab="", ylab="", cex.axis=0.7) # coloured matrix plot

# Fig.53b – statistical density plot of Rand index values
plot(density(rand, na.rm=TRUE), main="Density of Rand index between sectors")
abline(v=(0:10)/10, lty=3, col="grey80") # vertical lines

round(rand,2)   # Tab.15, rounded RI values

Table 15: Rand Index for sectoral comparisons

Figure 53: Rand Index for partitioning of sectoral distributions (based on k-means clustering of sectoral KDE)

3.4.5 Clustering of spatial series of KDE vectors

This part of the code is for clustering ‘spatial series’. It uses the same objects as created in the code for Fig.52. This previous code calculated a KDE for each sector in a for() loop and stored it as an object using the assign() function. This code uses these objects with a get() command. In both cases, Fig.52 and Fig.55, the names of the sectoral objects are built with the paste0() command, which merges the fixed part ‘firms.kde.’ and the variable part – a letter name of the sector. This code stores KDE values from KDE objects as vectors (with as.vector() command), normalises them and stores them as vectors (with assign() command). These sectorial vectors become a ‘spatial series’ – these are the 128 * 128 = 16’384 length vectors that lost the matrix formatting (KDE data are row-wise values from the top-left corner to the bottom-right corner). Spatial series are plotted as line plots, due to the normalisation procedure they are comparable.

# Fig.55: Plotting spatial series (vectorised KDE) for each sector (continued from code for Fig.49, Fig.50. Fig.51 and Fig.52)

for(i in 1:21){             # start of the loop, for each sector
v<-get(paste0("firms.kde.", sectors[i])) # call an KDE object
vv<-as.vector(v$v)          # use the KDE values
vv<-(vv-min(vv))/sum(vv-min(vv))# normalisation procedure
assign(paste0("v", sec), vv)    # store an object of KDE vector
}                       # end of the loop

# Plot the spatial series together
par(mfrow=c(7,3))   # Divide the window into 7 rows and 3 columns
par(mar=c(1,1,1,1)) # narrow margins
for(i in 1:21){     # start of the loop, for each sector
plot(get(paste0("v", sectors[i])), type="l", ylab=paste0("v", sectors[i]))      # plot of spatial series for each sector
}               # end of the loop
par(mfrow=c(1,1))   # back to 1*1 division of window

Figure 55: Spatial series (vectorised KDE) for 21 sectors (from A to U)

In the second part of the code (Fig.56), these ‘spatial series’ are hierarchically clustered as ‘time series’ using TSclust:: package. The algorithm looks for similar behaviour of the series. Due to the fixed division of space into pixels, a given observation (e.g. the 100th observation in each series) reflects the same location. As the KDE values have been normalised, the size of the phenomena (number of firms in a sector) does not matter. The code uses the diss() function from the TSclust:: package to compute Euclidean distances between these vectors and the basic hclust() command to prepare the hierarchical clustering. Data required processing such as renaming and rearranging. Plotting required saving graphic objects to memory (possible in ggplot() graphics) and plotting together using grid.arrange() from the gridExtra:: package. Long, composite graphs are finally saved as a ‘grob’ (‘description of graphical item’) using the ggplotGrob() command.

# Fig.56: Hierarchical clustering of spatial series

# Common data.frame object with spatial series 
# 21 columns, 128*128=16384 rows
ts.df<-as.data.frame(cbind(vA, vB, vC, vD, vE, vF, vG, vH, vI, vJ, vK, vL, vM, vN, vO, vP, vQ, vR, vS, vT, vU))
install.packages("TSclust")
install.packages("ggdendro")

library(TSclust)        # for clustering of time / spatial series
library(ggdendro)   # for dendrogram in ggplot() version
library(tidyverse)  # for pipe operations

ts.dist<-TSclust::diss(SERIES=t(ts.df), METHOD="EUCL")
hc<-stats::hclust(ts.dist, method="complete") # hierarchical cluster

# Prepare well-formatted clusters
hclus<-stats::cutree(hc, k=4) %>%   # cut into k=4 clusters
  as.data.frame(.) %>%          # format as data.frame
  dplyr::rename(.,cluster_group = .) %>%    # change names
  tibble::rownames_to_column("type_col")    # reshape data

hcdata<-ggdendro::dendro_data(hc)   # reshape data
names_order<-hcdata$labels$label        # create labels

# Dendrogram of spatial series with ggplot::
p1 <- hcdata %>%
  ggdendro::ggdendrogram(., rotate=TRUE, leaf_labels=FALSE)

# multiplot of spatial series to fit dendrogram
p2 <- ts.df %>%
  dplyr::mutate(index = 1:16384) %>%
  tidyr::gather(key=type_col, value=value, -index) %>%
  dplyr::full_join(., hclus, by="type_col") %>% 
  mutate(type_col=factor(type_col, levels= rev(as.character(names_order)))) %>% 
  ggplot(aes(x=index, y=value, colour=cluster_group)) +
  geom_line() +
  facet_wrap(~type_col, ncol=1, strip.position="left") + 
  guides(color=FALSE) +
  theme_bw() + 
  theme(strip.background=element_blank(), strip.text=element_blank())

gp1<-ggplotGrob(p1)
gp2<-ggplotGrob(p2) 

library(gridExtra)
grid.arrange(gp2, gp1, ncol=2, widths=c(4,2))

Figure 56: Spatial series clustering applied to vectorized KDE outputs

3.5 Spatial Relative Risk - comparisons of density of two groups (test and control)

This code presents the maps of spatial relative risk – for two variables, cases (firms) and controls (population). It uses ppp class objects from spatstat:: as input to the relrisk() function which estimates the relative risk. Spatstat:: results can be converted to the data.frame class and further to the sf class and plotted (Fig.58). However, this is not a big data solution – it can only handle a few thousand observations.

# Fig.57: Spatial relative risk for firms data related to population
# *** are lines that need to be changed in the next figure (Fig.58)

# Convert regional contour to planar (Mercator) projection (in sf::)
# and then to owin class from spatstat::
region.merc<-sf::st_transform(sf::st_as_sf(region), crs=sf::st_crs(3857))
region.owin<-spatstat.geom::as.owin(sf::st_geometry(region.merc))

# Prepare point data to analyse firms related to the population
# ‘cases’ are firms (in counter) 
# ‘control’ is population (in nominator)
nn<-5000        # ***, number of observations to consider
selector.popul<-sample(1:dim(popul)[1], nn, replace=FALSE)
popul.lim<-popul[selector.popul, 1:2] # xy columns and random rows
popul.lim$marks<-rep("control", times=nn) # marks of observations

selector.firms<-sample(1:dim(firms)[1], nn, replace=FALSE) #***
firms.lim<-firms[selector.firms, crds.col]              #***
colnames(firms.lim)<-c("x", "y")
firms.lim$marks<-rep("case", times=nn) # tags of observations

# Bind firms and popul data into single object and format it
all.lim<-rbind(firms.lim, popul.lim)    # Bind records
all.lim.sf<-st_as_sf(all.lim, coords=c("x", "y"), crs=4326)
all.lim.sf2<-sf::st_transform(sf::st_as_sf(all.lim.sf), crs=sf::st_crs(3857))   # Convert points projections to Mercator
all.lim.sf2     # Display the structure of the dataset
library(spatstat)
all.ppp<-ppp(x=st_coordinates(all.lim.sf2)[,1], y=st_coordinates(all.lim.sf2)[,2],window=region.owin, marks=as.factor(all.lim.sf2$marks))
all.ppp

# spatial relative risk
srr<-relrisk(all.ppp, se=TRUE)      # from spatstat::
plot(srr$estimate, main="  ")       # *** plot estimates of srr
#plot(srr$SE, main="  ")            # plot SE of srr estimates

# Fig.58 – the same as Fig.57, but for a limited dataset by sector
# Use the following lines instead of those labeled *** in Fig.57
nn<-5000        # number of observations to consider

firms.lim<-firms[firms$SEK_PKD7=="M",] # select a sector
selector.firms<-sample(1:dim(firms.lim)[1], nn, replace=FALSE)
firms.lim<-firms.lim[selector.firms, crds.col]

# spatial relative risk
srr<-relrisk(all.ppp, se=TRUE)      # from spatstat::
plot(srr$estimate, main="Spatial relative risk for sector M") # ***

Figure 57: Spatial relative risk for firms related to population; a) for 500 points, b) for 5000 points

The figures below require only minor changes to an existing code - as indicated by ***.

Figure 58: Relative risk of sectoral location compared to population: a) sector A (agriculture); b) sector M (professional, scientific and technical activities)

3.6 Spatial Benford distribution and natural spatial pattern for 3D data

This set of R codes works with Benford’s law. In the first part (Fig.59), it calculates the probability of consecutive first digits (from 1 to 9) and first two digits (from 10 to 99) of numbers in naturally created data sets according to the Benford distribution and plots it.

# Fig.59: Frequencies in Benford distribution

install.packages("benford.analysis")
library(benford.analysis)   # for Benford analysis

vec<-1:9                    # Fig.59a, for digits 1:9
benf<-matrix(0, nrow=9, ncol=1) # empty object
for(i in 1:9){              # loop for each digit
benf[i,1]<-p.these.digits(vec[i])}  # probabilities of each digit
benf                        # display the results

plot(1:9, benf[,1], type="l", ylim=c(0,0.35), axes=FALSE, xlab="", ylab="") # line plot
points(1:9, benf[,1], pch=21, bg="darkgreen", cex=2) # add points
title(main="Probability of given first digit 
in Beneford distribution from 1 to 9") # title
text(1:9, benf[,1]+0.02, round(benf[,1],2)) # text next to points
axis(1, at=1:9) # x axis
axis(2)     # y axis
abline(h=(0:10)*0.05, lty=3, col="grey80") # horizontal lines

vec<-10:99                  # Fig.59b, for numbers 10:99
benf<-matrix(0, nrow=90, ncol=1)    # empty object
for(i in 1:90){     # for numbers from 10 to 99
benf[i,1]<-p.these.digits(vec[i])}  # probabilities of each digit
benf                        # display the results

plot(10:99, benf[,1], type="l", ylim=c(0,0.05), axes=FALSE, xlab="", ylab="")   # line plot
points(10:99, benf[,1], pch=21, bg="darkgreen", cex=2) # add points
title(main="Probability of given first two digits 
in Beneford distribution – from 10 to 99") # title
axis(1, at=10:99)   # x axis
axis(2)         # y axis
abline(h=(0:10)*0.01, lty=3, col="grey80")  # horizontal lines

Figure 59: Benford distribution of first digit (a) and first two digits (b)

In the second part (Fig.60), it generates the mixture of spatial point patterns conforming to Benford’s law using st_sample(), calculates the distances between points using dist() and checks the Benford’s law conformity of these mutual distances using benford() from the benford.analysis:: package. All calculations are done in the Mercator (planar) coordinate system – this is to have data ready for to clarkevans.test() from spatstat::, which is used to assess the nature of the point patterns.

Benford’s law is scale independent. Distances (of the particular composition of the data) in any coordinate systems, projection and unit (km, miles, etc.) should obey this law. For plotting one can choose colours which were pre-selected from the viridis:: package. This analysis is then automated using the SpatBenfordPattern() command from the spatialWarsaw:: package.

# Fig.60: Conformity with Benford’s law: a) Benford-like mixed point pattern, b) distribution of first two digits for mutual distances of the Benford-like mixed point pattern

# Set colours for graphics
library(viridis)            # for colour palette
library(scales)         # for colour palette
library(RColorBrewer)       # for colour palette

colA<-viridis(15, option="A") # from viridis:: and scales::
colB<-viridis(15, option="B") 
colC<-viridis(15, option="C") 
colD<-viridis(15, option="D")
colE<-viridis(15, option="E") 
colALL<-cbind(colA, colB, colC, colD, colE)
show_col(colALL, ncol=15) # colour palettes to choose colours

# Select colours for all plots
# Colours: ordered / random / clustered centered / clustered skewed
cols<-c(colB[9], colC[13], colD[6], colD[9]) 
show_col(cols)

# convert polygon to Mercator coordinate system in class sf
poly.sf.merc<-sf::st_transform(poly.sf, crs=3857) # Mercator proj.
plot(st_geometry(poly.sf.merc))

# generate point-patterns
# number of obs. clustered/regular/random point-patterns
nn<-c(810, 120, 70)     # proportions must be fixed

x<-rnorm(nn[1],     3e+04, 15000)       # clustered skewed pattern
y<-rnorm(nn[1], 8e+04, 15000)
bb1<-data.frame(X=x, Y=y)

b2<-st_sample(poly.sf.merc, nn[2], type="regular")  # regular
bb2<-as.data.frame(st_coordinates(b2))

b3<-st_sample(poly.sf.merc, nn[3], type="random")   # random
bb3<-as.data.frame(st_coordinates(b3))

bb4<-rbind(bb1, bb2, bb3)       # all points in one object
bb4

# scatterplot of clustered point-pattern – Fig.60a
plot(bb2, pch=21, bg=alpha(cols[1], 0.7), col=alpha(cols[1],0.7)) # ordered
points(bb3, pch=21, bg=cols[2], col=cols[2]) # random
points(bb1, pch=21, cex=0.99, bg=cols[3], col=cols[3]) # clustered
legend("topright", legend=c("ordered", "random", "clustered"), fill=cols, cex=1.2, bty="n")
title(main="Composite point-pattern
with shifted core of clustered distribution ")

# Matrix of mutual distances between points – input of Benford dist.
dist.bb4<-dist(as.data.frame(x=bb4[,1], y=bb4[,2])) # distances
dist.bb4.m<-as.matrix(dist.bb4)         # distances as matrix
dist.bb4.v<-as.vector(dist.bb4.m)   # distances as vector

# Benford test 
bb<-benford.analysis::benford(dist.bb4.v)
bb
plot(bb)

# Clark-Evans test – to check the type of point pattern
library(spatstat)
poly.owin<-spatstat.geom::as.owin(sf::st_geometry(poly.sf.merc))
points.ppp<-ppp(x=bb4[,1], y=bb4[,2], window=poly.owin)
cc<-clarkevans.test(points.ppp) # Clark-Evans test for all data
cc

# barplot of Benford conformity – Fig.60b
tt<-as.data.frame(bb$bfd)
barplot(tt$data.second.order.dist.freq~tt$digits, main="Benford Second-Order Test – Composite pattern
all digits, 1.000 points, 500.000 mutual distances", xlab="digits", ylab="Second-order frequency distribution")
lines(tt$benford.so.dist.freq, lty=3, col="red", lwd=3)
text(25, 15000, labels=paste("R in Clark-Evans test:  ", round(cc$statistic,3)), pos=4, font=2)
text(25, 14000, labels=paste("p-value in Clark-Evans test:  ", round(cc$p.value,3)), pos=4, font=2)
text(25, 13000, labels=paste("Benford test:  ", bb$MAD.conformity), pos=4, font=2)

# automated solution to generate Benford-like pattern
library(spatialWarsaw)
sbp<-SpatBenfordPattern(region, 5000) # contour and number of points
head(sbp,10)    # crds of points and their tag (original pattern)

Figure 60: Conformity with Benford’s law: a) Benford-like mixed point pattern, b) distribution of first two digits for the mutual distances of the Benford-like mixed point pattern

In the third part (Fig.61), it is checked whether the mutual distances between the firms of the selected sectors conform with Benford’s law, i.e. whether it is a natural point pattern. It calculates the distances using st_distance() command from sf:: package and tests them with the benford() command from the benford.analysis:: package. The same is done in an automated version with SpatBenfordTest() from the spatialWarsaw:: package.

# Fig.61: Benford’s law conformity in the sectoral distribution of firms: a) agriculture (sector A),  b) public administration and security (sector O)

# Benford conformity – distribution of firms in a given sector
# Select sector A or O
sub<-firms[firms$SEK_PKD7=="A", 9:10]   # change the sector here
selected<-sample(1:dim(sub)[1], 1000, replace=FALSE)
sub1<-sub[selected, ] # subsample of firms
pts.sf<-st_as_sf(sub1, coords=c("lon", "lat"), crs=4326)
dist.sf<-sf::st_distance(pts.sf) # distances in km
bb.dist<-benford.analysis::benford(as.vector(dist.sf)) 

# Alternative solution using spatialWarsaw::
# it inputs coordinates and outputs Benford test
bb.dist1<-spatialWarsaw::SpatBenfordTest(sub1, 1000)

# Plot the Benford distribution
tt<-as.data.frame(bb.dist$bfd)
par(mar=c(4,4,4,4))
barplot(tt$data.second.order.dist.freq~tt$digits, main="Benford Second-Order Test 
for 1000 sample of sector location", xlab="digits", ylab="Second-order frequency distribution")
lines(tt$benford.so.dist.freq, lty=3, col="red", lwd=3)
text(25, 14000, labels=paste("Benford test:  ", bb.dist$MAD.conformity), pos=4, font=2)

Figure 61: Benford’s law (non)conformity in sectoral distributions of firms: a) agriculture (sector A), b) public administration and security (sector O)

3.7 Directional density with rose diagram

This code presents how to calculate angles between individual points and a Point of Interest (POI) and summarize them with a rose diagram to obtain the directional density of point locations. The codes for Fig.63 and Fig.64 start with the same data preparation and angle counting. In Fig.63 the subsample is very small in order to to plot the points labelled with their angles – this is crucial to understand where a given angle is located and to set the rose diagram correctly. Note that the angles were calculated using the atan2() command which normally outputs arc-tangent values – so they are converted to degrees by multiplying by 180 (°) and dividing by the pi value. However, this output is from 0 to 180° and from 0 to -180°, while the rose() from the spatstat:: package works on a scale from 0° to 360°. Therefore there is another angle transformation to this scale – by subtracting (and actually adding) negative values to 360°. Fig.60 allows an almost manual check that the degree scale is continuous and correctly positioned. The rose() command must start in the east (option start=“E”) and rotate counter-clockwise (clockwise=FALSE).

# Data preparation and calculations for Fig.63 and Fig.64
nn<-100 # Fig.63 - few observations to see details well
nn<-10000   # Fig.64 – many observations for empirical study

selected<-sample(1:dim(firms)[1], size=nn, replace=FALSE)
firms.sel<-firms[selected, ]
firms.sel<-firms.sel[firms.sel$dist_core_25==1, 9:10]
colnames(firms.sel)<-c("x", "y")
center<-c(21.016, 52.254)   # city center, reference point POI
center.df<-data.frame(x=21.016, y=52.254) # the same as data.frame

nn2<-dim(firms.sel)[1]  # actual number of observations
angles<-matrix(0, nrow=nn2, ncol=1) # output depot
for(i in 1:nn2){                    # for() loop
angles[i]<-atan2(firms.sel[i,2]-center[2], firms.sel[i,1]-center[1])*180/pi}            # angles between firms and center
angles2<-as.data.frame(angles)  # convert to data.frame
minus<-which(angles2$V1<0)      # selecting negative angles
angles2$V2<-angles2$V1
angles2$V2[minus]<-360+angles2$V1[minus] # convert to 0-360° scale

# Fig.63a - plot of locations using angles as labels 
plot(firms.sel, pch=21, bg="black", cex=0.6)    # plot of points
points(center[1], center[2], pch=21, cex=3, bg="red")# POI
text(firms.sel[,1], firms.sel[,2]+0.005, round(angles2$V2,2), cex=0.7)  # text next to points
abline(v=center[1], lty=3)      # vertical line
abline(h=center[2], lty=3)      # horizontal line
legend("topright", "0°-90°", bty="n", text.font=4)
legend("topleft", "90°-180°", bty="n", text.font=4)
legend("bottomleft", "180°-270°", bty="n", text.font=4)
legend("bottomright", "270°-360°", bty="n", text.font=4)
text(max(firms.sel$x), center[2], "0°", font=4)
text(center[1], max(firms.sel$y), "90°", font=4)
text(min(firms.sel$x), center[2], "180°", font=4)
text(center[1], min(firms.sel$y), "270°", font=4)
title(main="Angles between origin and points, in degrees")

# Fig.63b – Rose diagram with spatstat:: for directional density
library(spatstat)
rose(angles2$V2, col="lightpink2", start="E", clockwise=FALSE, main="Rose diagram - directional density") # with spatstat::

Figure 63: Directional density: a) angle of location from the Point of Interest (in red in the centre); b) rose diagram for angular data

In Fig.64 the additional layer of rivers taken from OpenStreetMap (OSM) has been added. This is to show the reason for the discontinuity in the location of companies. This was done using the osmdata:: package, which uses pipe operations (%>%). The full list of possible features that can be extracted from OSM can be found at: https://wiki.openstreetmap.org/wiki/Map_features. It can also be accessed with available_features() /nothing in bracket/ and available_tags() /with the selected feature in the bracket e.g. available_tags(“highway”)/. The figure which was presented in the book was plotted using the plot() function. The alternative ggplot() graphics requires setting the x and y boundaries of the graph, which were taken from the administrative region bounding box obtained with st_bbox(). The rose diagram uses the same syntax as in Fig.63.

# Fig.64: Illustration of analysed points in Warsaw and surroundings
# download rivers in the city of Warsaw from OpenStreetMap
# Contour of Warsaw (Warszawa) city was prepared in ‘starter’
library(osmdata)
getbb("Warszawa") %>%       # get the rivers as lines
  opq() %>%
  add_osm_feature(key="waterway", value="river") %>%
  osmdata_sf()->river

# Fig.64a
library(sp)           # for degAxis()
plot(st_geometry(Warszawa), lwd=2, col="grey85")    # contour of city
points(firms.sel, pch=21, bg="black", cex=0.6)  # firms as points
points(center[1], center[2], pch=21, cex=3, bg="red")   # central point
plot(st_geometry(river$osm_lines), add=TRUE, col="steelblue1", lwd=3)   # rivers
degAxis(1)  # geographic axis, longitude, with sp::
degAxis(2)  # geographic axis, latitude
title(main="Firms located in Warsaw and surrounding") # main title

# Fig.64a - the same illustration in ggplot()
bb<-st_bbox(Warszawa)       # bouding box of Warsaw
ggplot() + geom_sf(data=Warszawa) + geom_point(data=firms.sel, aes(x,y), color="grey99") + geom_sf(data=river$osm_lines, inherit.aes=FALSE, color="deepskyblue", size=1.8, alpha=0.8) + xlim(bb$xmin, bb$xmax) + ylim(bb$ymin, bb$ymax) + geom_point(data=center.df, aes(x,y), color="red", cex=3)

# Fig.64b - Directional density with rose diagram (with spatstat::)
rose(angles2$V2, col="lightpink2", start="E", clockwise=FALSE, main="Rose diagram - directional density
for firms located in Warsaw")

# Fig.64c for population
nn<-10000
selected<-sample(1:dim(popul)[1], size=nn, replace=FALSE)
popul.sel<-popul[selected, 1:2]

nn2<-dim(popul.sel)[1]
angles<-matrix(0, nrow=nn2, ncol=1)
for(i in 1:nn2){
angles[i]<-atan2(popul.sel[i,2]-center[2], popul.sel[i,1]-center[1])*180/pi}
angles2<-as.data.frame(angles)
minus<-which(angles2$V1<0)
angles2$V2<-angles2$V1
angles2$V2[minus]<-360+angles2$V1[minus]

# Fig.64c
plot(st_geometry(Warszawa), lwd=2, col="grey85")
points(popul.sel, pch=21, bg="black", cex=0.6)
points(center[1], center[2], pch=21, cex=3, bg="red")
plot(st_geometry(river$osm_lines), add=TRUE, col="steelblue1", lwd=3)
degAxis(1)      # x axis
degAxis(2)      # y axis
title(main="Population located in Warsaw and surrounding")

# Fig.64d
rose(angles2$V2, col="palegreen1", start="E", clockwise=FALSE, main="Rose diagram - directional density
for population located in Warsaw")

Figure 64: Firms and population locations in Warsaw as angle to city centre (red point): a) location of firms as point data; b) rose diagram illustrating directional density of firms; c) location of inhabitants as point data; d) directional density of population

Codes for Chapter 4

4.1 Spatial density data in econometric models

This code is for mapping the local density of businesses: the number of service businesses within a close radius (500m) of each business (Fig.66a) and a dummy variable indicating whether a given business was classified as a high density business (using the DBSCAN algorithm) (Fig.66b). Two plotting schemes were used: with plot() and findInterval() in Fig.66a and ggplot() in Fig.66b.

# Fig.66a – local density of service firms (in radius of 500 m)
variable<-sub.firms$locAggServ  # variable to plot
cols<-rainbow(10)           # colours
stat<-summary(variable)     # summary of data
brks<-c(stat[1], stat[2], stat[3], stat[5], stat[6])    # thresholds

par(mar=c(2,2,3,3))         # margins in the figure
plot(st_geometry(region), border="grey50")      # contours of region
points(sub.firms[,9:10], col=cols[findInterval(variable, brks)], pch=16)    # points in colours, 9:10 are columns with xy coordinates 
legend("topright", legend=round(brks,3), fill=cols, cex=0.8, bty="n")                   # legend
title(main=paste0("Local density of points 
(number of service firms around, in 500 m), by quartiles"), sub="In legend, intervals from …")      # title of the figure

#Fig.66b – firms classified into high-density DBSCAN clusters
ggplot()+ geom_sf(data=sub.firms.sf, aes(col=COREfirms))

Figure 66: Local density of points: a) number of service firms around business entities (in a radius of 500 m); b) firms classified as belonging to a high-density DBSCAN clusters

4.2 K-nearest neighbours and Voronoi spatial weights matrix for point data

The first part (Fig.69) is an automated code using corrSpatialLags() from the spatialWarsaw:: package to present the theoretical and empirical correlation between spatial lags with knn and knn+i neighbours.

# Fig.69: Theoretical correlations between spatial lags with different numbers of k nearest neighbours 

library(spatialWarsaw)
csl<-corrSpatialLags(firms.sf, "locPdens", 1000, knn=1:30)
csl # show correlation matrix for spatial lags with different knn

Figure 69: Theoretical and empirical correlations between spatial lags with different numbers of k nearest neighbours

In the second part (Fig.70 and Tab.17) it shows how to prepare a spatial weight matrix W based on Voronoi tessellation for a point data. It runs the Voronoi tessellation in class sf (in planar coordinate system) and makes polygons from the tessellation tiles – this becomes an input to build a contiguity matrix with spdep:: package. It also builds the higher orders spatial weight matrices (neighbours of neighbours by contiguity) using the nblag() and nblag_cumul() commands from the spdep:: package. It plots the selected neighbourhoods with the plot() command. The Voronoi-based W is also computed automatically using tessW() from the spatialWarsaw:: package.

# Tab.17 & Fig.70: Tessellation-based contiguity spatial weight matrices

# Subset of data & planar (Mercator) coordinates
selector<-sample(1:dim(firms)[1], size=500, replace=FALSE)
firms.sf.sub<-firms.sf[selector, ]
firms.sf.sub.merc<-sf::st_transform(firms.sf.sub, crs=3857)
region.merc<-sf::st_transform(region, crs=3857)

# Voronoi tessellation in class sf
points.sfc<-st_geometry(firms.sf.sub.merc)
box.sfc<-st_geometry(region.merc)
points.union<-st_union(points.sfc)
tess<-st_voronoi(points.union, box.sfc)
tess.clip<-st_intersection(st_cast(tess), st_union(box.sfc))
crds.sf<-st_centroid(st_geometry(tess.clip)) # centroids of tiles

# spatial weight matrix W based on tessellation tiles
# 1st row contiguity 
library(spdep)
tess.nb<-poly2nb(tess.clip)             # class nb
tess.listw<-nb2listw(tess.nb, style="W")        # class listw
summary(tess.listw)                     # data for Tab.17

# Plot of contiguity W from tesselation
plot(st_geometry(tess.clip))                # plot tessellation
plot(tess.nb, crds.sf, add=TRUE)            # plot links

# Second row nb
nb2<-nblag(tess.nb, 2)
nb2cum<-nblag_cumul(nb2)
nb2cum                          # data for Tab.17

# Third row nb
nb3<-nblag(tess.nb, 3)
nb3cum<-nblag_cumul(nb3)
nb3cum                          # data for Tab.17

# Fourth row nb
nb4<-nblag(tess.nb, 4)
nb4cum<-nblag_cumul(nb4)
nb4cum                          # data for Tab.17

Table 17: Characteristics of tessellation-based spatial weight matrices W for different rows of neighbourhood (nb) for 500 points

# Fig.70
# 1st row contiguity – plot of selected neighbours
plot(st_geometry(tess.clip))    # plot tessellation tiles
plot(st_geometry(tess.clip)[unlist(tess.nb[34])], col="red", add=TRUE) # neighbours for observation no 34
plot(st_geometry(tess.clip)[34], col="darkred", add=TRUE)
title(main="Contiguity neighbours of 1st row", sub="spatial weights matrix from tessellation tiles (Voronoi polygons)")

# 4th row contiguity – plot of selected neighbours
plot(st_geometry(tess.clip)) # plot tessellation tiles
plot(st_geometry(tess.clip)[unlist(nb4cum[34])], col="red", add=TRUE) # selected neighbours in red
plot(st_geometry(tess.clip)[34], col="darkred", add=TRUE)
title(main="Contiguity neighbours of 4th row", sub="spatial weights matrix from tessellation tiles (Voronoi polygons)")

# automated solution for Voronoi-based W
library(spatialWarsaw)
subfirms.sf<-firms.sf[1:100,]
my.tess<-tessW(subfirms.sf, region, sample_size = nrow(subfirms.sf))
my.tess

Figure 70: Visualisation of the tessellation-based contiguity spatial weight matrices: a) for the first-row neighbourhood, b) for the fourth-row neighbourhood

In the third part (Fig.71), the code presents how to find the optimal number of nearest neighbours to include in the spatial weight matrix based on the knn criterion. It uses a spatial econometric model (SEM, Spatial Error Model) with a given equation. In the automated approach it uses the bestW() command from spatialWarsaw:: package. The same solution is presented step by step: first, it ‘jitters’ the points (adds an epsilon to the x and y coordinates to avoid overlapping) – random values from a normal distribution rnorm() are used; second, in a for() loop, it creates the knn W with an increasing number of neighbours and estimates the spatial econometric model using errorsarlm() from the spatialreg:: package. Finally, it plots AIC and spatial parameters (e.g. lambda) from spatial models to find the optimal knn – it is at the point of minimum AIC.

# Fig.71: AIC optimised knn

# Equation for estimation based on point data
eq<-locAggTOTAL ~ locHightech + locBIG + locPdens +  COREfirms + dist_core_10 + dist_midsize_10 + dist_regional_10 + dist_localbig_10 + dist_localsmall_10 

# Optimal knn to construct spatial weight matrix W for points
# automatic solution with spatialWarsaw:: package
library(spatialWarsaw)
best.W<-spatialWarsaw::bestW(firms.sf, eq, model_type="SEM", 1000, knn = c(2,5,10,15,20)) # one can choose "SDM" and proper knn
best.W

# stepwise solution for the same equation
vec<-(1:10)*4           # vector of knn-s for analysis

# jittering of coordinates to avoid overlaps
selector<-sample(1:dim(firms)[1], size=2000, replace=TRUE)
sub<-firms[selector, ]          # subset of data
crds<-firms[selector, crds.col]     # xy coordinates
epsilon.x<-rnorm(dim(crds)[1], 0, sd(crds[,1])/1000) # (n,mean,sd)
epsilon.y<-rnorm(dim(crds)[1], 0, sd(crds[,2])/1000)
crds[,1]<-crds[,1]+epsilon.x    # new jittered x-coordinates
crds[,2]<-crds[,2]+epsilon.y    # new jittered y-coordinates

# objects for output
output<-matrix(0, nrow=length(vec), ncol=3)
colnames(output)<-c("knn", "AIC", "lambda")

# loop for each knn
install.packages("spatialreg")
library(spatialreg)
for(i in 1:length(vec)){            # here it is vec=(1:10)*4
knnW<-nb2listw(make.sym.nb(knn2nb(knearneigh(as.matrix(crds), k=vec[i]))))          # spatial weight matrix based on knn
model<-errorsarlm(eq, data=sub, knnW)       # spatial error model
output[i,1]<-vec[i]     # store number of knn
output[i,2]<-AIC(model) # save AUC as a quality metric
output[i,3]<-model$lambda}  # save spatial parameter lambda 

output[output==0]<-NA
output<-as.data.frame(output)

# Spatial weight matrix W with the best number of neighbours
best.line<-which(output$AIC==min(output$AIC, na.rm=TRUE))

bestW<-nb2listw(make.sym.nb(knn2nb(knearneigh(as.matrix(crds), k=vec[best.line]))))

# plot AIC as a function of knn
plot(output[,1:2], type="l", xlab="knn", ylab="AIC", lwd=2)
points(output[,1:2], pch=21, bg="lightblue", cex=1.5)
title("Minimising AIC by selecting knn")
abline(v=vec[best.line], lty=3, col="grey60")
text(vec[best.line], max(output$AIC, na.rm=TRUE), "best knn")
text(vec[best.line], max(output$AIC, na.rm=TRUE)-0.08*abs(max(output$AIC)-min(output$AIC)), output[best.line,1])

# plot of lambda as a function of knn
plot(output[,c(1,3)], type="l", xlab="knn", ylab="rho", lwd=2)
points(output[,c(1,3)], pch=21, bg="coral2", cex=1.5)
title("Lambda for different knn")
abline(v=vec[best.line], lty=3, col="grey60")
text(vec[best.line], min(output$lambda, na.rm=TRUE), "best lambda")

# Comparison of models with best AIC W and tessellated W
bestW<-nb2listw(make.sym.nb(knn2nb(knearneigh(as.matrix(crds), k=32)))) # AIC-optimised W
model.aic<-errorsarlm(eq, data=sub, bestW)

crds.sf<-st_as_sf(crds, coords=c("lon", "lat"), crs=4326)
my.tess<-tessW(crds.sf, region, sample_size=nrow(crds.sf)) # tessW
model.tess<-errorsarlm(eq, data=sub, my.tess)

# Calculate distances to knn
dd<-kNNdist(as.matrix(sub[,9:10]), k=32, all=TRUE) # dbscan::
dd.max<-apply(dd,1, max)    # distance to knn=32 for each obs.

Figure 71: Optimal knn search results: a) AIC from alternative models, b) lambda from alternative models

4.3.1 Introduction to GWR – general rules, stationarity, kernels

This code presents the estimation of GWR (Geographically Weighted Regression) models using GWmodel:: package. The package still uses sp class objects for point data, therefore one should create an object of such a class with coordinates() command (not typical syntax). Estimating the GWR requires calculating the distance matrix between observations (with the gw.dist() command), estimating the proper bandwidth with bw.gwr() and obtaining the final model with gwr.basic(). Note that finding the bandwidth is done iteratively – by estimating alternative models with wider or narrower radius (bandwidth) to minimise the cross-validation (CV) statistics. This can take some time, as it usually requires several (about 15-20) iterations – each iteration involves running an individual OLS subsample model for each observation (e.g. for 1000 observations, the bandwidth search requires about 15’000-20’000 regressions). The code also presents the GWR results graphically (Fig.73a) – as a map of the individual coefficients using plot() and findInterval(), and as a boxplot of coefficients by variable using boxplot() (Fig.73b).

# Fig.73: GWR estimates: a) mapping of GWR coefficient, b) boxplots of normalized coefficients

# Equation to estimate
eq<-locAggTOTAL ~ locHightech + locBIG + locPdens +  COREfirms + dist_core_10 + dist_midsize_10 + dist_regional_10 + dist_localbig_10 + dist_localsmall_10 

# sampling to get subset – by randomly selecting rows 
nn<-5000                        # size of subsample
selector<-sample(1:dim(firms)[1], nn, replace=FALSE) # selected rows
sub.firms<-firms[selector, ]        # subsample of firms in df
sub.firms.sf<-firms.sf[selector, ]  # subsample of firms in sf

# Estimate the GWR model
install.packages("GWmodel")
library(GWmodel)        # for GWR models
library(sp)     # for sp class
sub.firms.sp<-sub.firms # preparing dataset in sp class
coordinates(sub.firms.sp)<-c("lon", "lat")  # dataset in sp class

dMat<-gw.dist(cbind(sub.firms$lon, sub.firms$lat)) # distance matrix
bw<-GWmodel::bw.gwr(eq, data=sub.firms.sp, kernel='gaussian', adaptive=FALSE, dMat=dMat)    # optimised bandwidth
model.gwr<-GWmodel::gwr.basic(eq, data=sub.firms.sp, kernel='gaussian', adaptive=FALSE, bw=bw, dMat=dMat)  # GWR model

head(model.gwr$SDF)     # GWR coefficients – first few obs. & variables

# Map GWR coefficients for a selected variable - Fig.73a
variable<-as.data.frame(model.gwr$SDF[,2])[,1]
cols<-rainbow(10)
stat<-summary(variable)
brks<-c(stat[1], stat[2], stat[3], stat[5], stat[6]) # from summary

par(mar=c(1,2,2,2))                 # narrow margins
plot(st_geometry(region), border="grey50")  # plot contour
points(sub.firms[, 9:10], col=cols[findInterval(variable, brks)], pch=16)   # plot points, 9:10 are columns with xy coordinates
legend("topright", legend=paste("from ",round(brks,3)[1:4], "to ", round(brks,3)[2:5]), fill=cols, cex=0.8, bty="n")
title(main="GWR coefficients of selected variable by quartiles")

# Boxplot statistics of normalised GWR coefficients - Fig.73b
library(clusterSim)         # for quick normalisation
par(mar=c(3,3,3,3))         # wider margins
coefs<-as.data.frame(model.gwr$SDF[,1:10])[,1:10]
boxplot(data.Normalization(coefs, type="n1"), horizontal=TRUE, ylab=" ", axes=FALSE, ylim=c(-40, 20))   # with clusterSim::
axis(1)                         # axis
text(rep(-30, times=10), 1:10, names(model.gwr$SDF[,1:10]))
title(main="Normalised GWR coefficients")   # title
abline(v=0, lty=3, col="grey80")            # vertical line

Figure 73: Visualisation of the GWR model: a) map of GWR coefficients for a selected variable, b) boxplot of normalized coefficients

The second part of the code shows how to check which variables improve the quality of the model in terms of AIC. The search algorithm (using gwr.model.selection()) estimates the models in an iterative process, starting with a single variable and adding another ones. The models are sorted by AIC (in decreasing order) and plotted on a spiral shell-like diagram (using gwr.model.view()) – the last model is the best model (Fig.74a). One can also examine the AIC of AIC-sorted models (using model.sort.gwr()) to check for how much the quality increases with each variable. Again, the last models, with the lowest AIC, are the best ones (Fig.74b).

# Fig.74: Selecting variables for the GWR model

library(GWmodel)
nn<-5000            # subset of data for faster computations
selector<-sample(1:dim(firms)[1], nn, replace=FALSE)
firms.sub<-firms[selector, ]

firms.sp<-firms.sub # transform data to sp class
coordinates(firms.sp)<-c("lon", "lat") # SpatialPointsDataFrame

DM<-gw.dist(as.matrix(firms.sub[,c("lon", "lat")]))# distance matrix
DepVar<-"locAggTOTAL"   # name of dependent variable
ExpVars<-c("locHightech","locBIG","locPdens","COREfirms", "dist_core_10", "dist_midsize_10", "dist_regional_10", "dist_localbig_10", "dist_localsmall_10") # explanatory variables

# estimation of models with combinations of explanatory variables
model.sel<-gwr.model.selection(DepVar, ExpVars, data=firms.sp, kernel="gaussian", dMat=DM, bw=0.5) 
model.list<-model.sel[[1]]

# Fig.74a  – models sorted by AIC (the last model with min AIC)
gwr.model.view(DepVar, ExpVars, model.list=model.list)

# Fig.74b – AIC of models sorted by AIC in decreasing order
# Sort models by AIC for plotting
sorted.models<-model.sort.gwr(model.sel, numVars=length(ExpVars), ruler.vector=model.sel[[2]][,2]) 

plot(sorted.models[[2]][,2], col="black", pch=20, lty=5, main="AIC for GWR models", ylab="AICc", xlab="Model number", type="b")

Figure 74: Selection of variables for GWR models based on AIC: a) spiral shell-like diagram, b) AIC in consecutive models

4.3.2 Spatial drift

This R code is a follow-up to the estimation with the previous code (Fig.73). In its first part (Fig.75), it clusters the normalised GWR coefficient with the k-means clustering algorithm using kmeans() from stats:: package and plots the clusters. The second part of the code (Fig.76) is to estimate the spatial drift model – this is the global spatial econometric model (using the errorsarlm() command from the spatialreg:: package) which includes dummy variables (created with rep() and conditional operations) representing the k-means clusters from the code in Fig.75. The spatial weights matrix is optimised using bestW() command (approach from Fig.71). Models with and without spatial regimes dummies are summarised and compared using the screenreg() command from the texreg:: package.

# Fig.75: Clustering of GWR coefficients (continued from Fig.73)

# clusters of coefficients – we use k-means algorithm
nc<-5           # number of clusters
data<-as.data.frame(model.gwr$SDF[,1:10])[,1:10]    # from Fig.73 code
crds<-as.data.frame(model.gwr$SDF[,1:10])[,11:12]

# k-means clustering of normalized GWR coefficients
c.coef<-kmeans(data.Normalization(data, type="n1"), nc)

# Fig.75a - Silhouette statistic to check the quality of the clustering 
install.packages("factoextra")
library(factoextra)
fviz_nbclust(data.Normalization(data, type="n1"), FUNcluster=kmeans)

# Fig.75b - Plot clusters of coefficients
clust<-c.coef$cluster       # variable to visualise
cols<-rainbow(nc)       # colour palette    
brks<-c(1,2,3,4,5)      # intervals for cluster values
plot(st_geometry(region), main="K-means clusters of GWR coefficients")          # plot contour of region
points(crds, pch=21, bg=cols[findInterval(clust, brks)]) # points
legend("bottomleft", legend=1:nc, pt.bg=cols, bty="n", pch=21, cex=0.99)

Figure 75: K-means clustering of GWR normalised coefficients: a) optimal number of clusters based on silhouette statistics, b) spatial distribution of k-means clusters

# Fig.76: Clustering of GWR coefficients (continued from Fig.73 & Fig.75)

library(factoextra)
library(clusterSim)
clust2<-eclust(data.Normalization(data, type="n1"), "kmeans", k=5) 
fviz_silhouette(clust2) # silhouette plot based on ggplot::

Figure 76: Quality of division into k-means clusters: a) average silhouette and size of each cluster, b) individual silhouette plot

# Tab.18: Global spatial models: a) without GWR clusters, b) with GWR clusters (spatial drift model) 

# Dummy variables to express spatial drift
# Each dummy represents a one cluster of GWR coefficients
# Global model includes k-1 clusters (to avoid multicollinearity)

sub.firms$clust1<-rep(0, times=dim(sub.firms)[1])
sub.firms$clust1[clust==1]<-1
sub.firms$clust2<-rep(0, times=dim(sub.firms)[1])
sub.firms$clust2[clust==2]<-1
sub.firms$clust3<-rep(0, times=dim(sub.firms)[1])
sub.firms$clust3[clust==3]<-1
sub.firms$clust4<-rep(0, times=dim(sub.firms)[1])
sub.firms$clust4[clust==4]<-1

# Equations to compare
# Previous equation, used in GWR estimation
eq1<-locAggTOTAL ~ locHightech + locBIG + locPdens +  COREfirms + dist_core_10 + dist_midsize_10 + dist_regional_10 + dist_localbig_10 + dist_localsmall_10

# New equation including clusters of GWR coefficients
eq2<-locAggTOTAL ~ locHightech + locBIG + locPdens +  COREfirms + dist_core_10 + dist_midsize_10 + dist_regional_10 + dist_localbig_10 + dist_localsmall_10+clust1+clust2+clust3+clust4

# Spatial error model
# checking the best number of knn to include in spatial model
best.W<-spatialWarsaw::bestW(sub.firms.sf, eq1, model_type="SEM", 1000, knn=c(2,5,10,15,20))

# Spatial weight matrix for best knn
bestW<-nb2listw(make.sym.nb(knn2nb(knearneigh(as.matrix(sub.firms[,9:10]), k=5))))

# Model without spatial drift
model.sem1<-errorsarlm(eq1, data=sub.firms, bestW)
summary(model.sem1)

# Model with spatial drift
model.sem2<-errorsarlm(eq2, data=sub.firms, bestW)
summary(model.sem2)

# Table comparing regression models
install.packages("texreg") # for quick comparison of models
library(texreg)
screenreg(list(model.sem1, model.sem2)) # summary of models

Table 18: Global spatial models: a) Model 1 - without GWR clusters, b) Model 2 - with GWR clusters (spatial drift model)

4.3.3 Temporal GWR for spatio-temporal stability

The code in Fig.78 presents a step-by-step analysis that compares the spatio-temporal stability of the GWR coefficients. It starts by ‘jittering’ locations to avoid overlaps, creates an artificial variable ‘year’ using sample() to split the dataset by time and stores the object in sf class. It also prepares the sp class object for GWR estimation and raster using rast() from terra:: to split observations (and GWR estimates) into raster cells. In a for() loop it estimates the GWR models for each year, clusters the GWR coefficients using the kmeans algorithm and rasterises the k-means ID (number of a cluster) by indicating the median value within the cell. Finally, it plots the raster with the median cluster ID as a variable in colour.

# Fig.78: Rasterised GWR coefficients and comparisons

library(terra)      # for rast() & rasterize()
library(GWmodel)        # for GWR models
library(sp)     # for sp class

nn<-10000       # size of subsample
selector<-sample(1:dim(firms)[1], nn, replace=FALSE)
epsx<-rnorm(nn, 0, 0.0001)  # epsilon for x coordinates
epsy<-rnorm(nn, 0, 0.0001)  # epsilon for y coordinates
sub.firms<-firms[selector, ]    # subset
sub.firms$lon<-sub.firms$lon+epsx   # corrected locations
sub.firms$lat<-sub.firms$lat+epsy   # corrected locations
sub.firms$year<-sample(2018:2023, size=nn, replace=TRUE) # year
sub.firms.crds<-as.matrix(sub.firms[, c("lon", "lat")])  # crds 
sub.firms.sf<-st_as_sf(sub.firms, coords=c("lon", "lat"), crs=4326) 

# equation to estimate
eq<-locAggTOTAL ~ locHightech + locBIG + locPdens +  COREfirms + dist_core_10 + dist_midsize_10 + dist_regional_10 + dist_localbig_10 + dist_localsmall_10 

# customize dataset objects – convert to sp class
data.sp<-sub.firms  # copy-paste an object
coordinates(data.sp)<-c("lon", "lat")   # convert to sp class

# prepare a raster
bb<-st_bbox(region)         # bounding box
nrows.raster<-50            # number of rows in the raster
ncols.raster<-50            # number of columns in the raster
r<-terra::rast(nrows=nrows.raster, ncols=ncols.raster, xmin=bb[1], ymin=bb[2], xmax=bb[3], ymax=bb[4])  # raster from terra::

# GWR estimation by years – in a loop for years
# first it goes for one year to fix bandwidth (bw)
# then it uses this bandwidth for other years

time<-sub.firms$year                # variable of time
timeslot<-sort(unique(time))        # years for the loop

# first run to get bandwidth – for the first yeat
datax.df<-sub.firms[time==timeslot[1],]     # select timeslot
datax.sp<-data.sp[time==timeslot[1],]   # select timeslot
crds<-coordinates(datax.sp)         # crds in sp class
dMat<-gw.dist(crds)                 # distance matrix
bw<-GWmodel::bw.gwr(eq, data=datax.sp, kernel='gaussian', adaptive=FALSE, dMat=dMat)                # get bandwidth

for(i in 1:length(timeslot)[1]){            # loop for each year
datax.df<-sub.firms[time==timeslot[i],]     # timeslot selection
datax.sp<-data.sp[time==timeslot[i],]   # timeslot selection
crds<-coordinates(datax.sp)         # crds in sp class
dMat<-gw.dist(crds)                 # distance matrix
modelGWR<-GWmodel::gwr.basic(eq, data=datax.sp, kernel='gaussian', adaptive=FALSE, bw=bw, dMat=dMat)        # GWR model

# Select data for clustering and cluster them
end<-which(names(modelGWR$SDF)=="y")-1 # number of columns with GWR
nc<-5                        # number of clusters
km.clust<-kmeans(as.data.frame(modelGWR$SDF[,1:end]), nc) 
raster<-rasterize(crds, r, value=km.clust$clust,  fun=median)

# Save objects to memory with timeslot in object name
# paste0() merges text and year; assign() gives this name to object
assign(paste0("modelGWR.", timeslot[i]), modelGWR) 
assign(paste0("km.clust.", timeslot[i]), km.clust)
assign(paste0("rast.", timeslot[i]), raster)
}       # end of the loop

# Plot rasters with median cluster ID in raster
# Median / mode value for k-means clusters for GWR coefficients
library(viridis)        # colour palette
viri_cols<-viridis(n=5, option="F")

# Fig.78a
plot(rast.2018, col=rev(viri_cols), colNA="white", main="Clusters for a model for a year 2018")
plot(st_geometry(region), add=TRUE)

# Fig.78b
plot(rast.2022, col=rev(viri_cols), colNA="white", main="Clusters for a model for a year 2022")
plot(st_geometry(region), add=TRUE)

Figure 78: Rasters with a dominant ID of clusters of GWR coefficients for selected years: a) for 2017, b) for 2022

The second part of the code (Fig.79) displays the boxplot of the coefficients over time and calculates the Rand Index for pairs or pairs of raster cells with median IDs. The code presents the step-by-step solution using the fossil::adj.rand.index() command. There is also an automated solution that covers all the steps in Fig.78 and Fig.79 using sts() from the spatialWarsaw:: package.

# Fig.79a:  Boxplot of coefficients over time 
# for a selected variable (in [])
# data.frame for the 1st year with coefficients and year label
coef.over.time<-data.frame(year=2018, var=(modelGWR.2018$SDF[,5]))
# loop that rbinds the same data for the next years
for(i in 2019:2023){
a<-data.frame(year=i, var=(modelGWR.2018$SDF[,5]))
coef.over.time<-rbind(coef.over.time, a)}
# change for universal names
colnames(coef.over.time)<-c("year", "variable", "lon", "lat", "opt")
boxplot(coef.over.time$variable~coef.over.time$year) # boxplot
title(main="Variability of GWR coefficients over time")

# Fig.79b: Rand Index for clusters of rasterised GWR coefficients
# analysing spatio-temporal stability (STS) (continued from Fig.77)

# Automated solution that plots the matrix of Rand Index
# RI compares results of Fig.78 in pairs of years
library(spatialWarsaw)
my.sts<-sts(sub.firms.sf, eq, "year", region, nc=5, bw=0.05)
my.sts

# Manual solution – code continued from Fig.78
library(fossil)         # for Rand Index
library(plot.matrix)        # for plot.matrix()
library(RColorBrewer)       # for colour palette
colours<-brewer.pal(8, "BuPu") # colours from blue to purple

# names of objects for each year, use ‘rast.’ objects from  Fig.77
vec.coef<-paste0("rast.", timeslot)

# Object for Rand Index
tab.ri.coef<-matrix(0, nrow=length(timeslot), ncol=length(timeslot))

# RI for GWR coefficients
for(i in 1:length(timeslot)){   # start of the double loop
for(j in 1:length(timeslot)){
rix<-fossil::adj.rand.index(as.vector(get(vec.coef[i])), as.vector(get(vec.coef[j]))) # Rand Index, check pairs of vectors
tab.ri.coef[i,j]<-rix}  # save the results to matrix
}                   # end of the loop

print(tab.ri.coef)[1:6, 1:6] # RI matrix, first 6 rows and columns
chisq.test(tab.ri.coef) #Chi2 test: H0 independent, H1 dependent

# plot matrix with Rand Index using plot.matrix:: 
diag(tab.ri.coef)<-NA   # for better plotting
par(mar=c(6,6,6,6))     # wide margins
colnames(tab.ri.coef)<-min(timeslot):max(timeslot)
rownames(tab.ri.coef)<-min(timeslot):max(timeslot)

plot(tab.ri.coef, col=colours, main="Rand Index for clusters of GWR coefficients", xlab=" ", ylab=" ", cex.axis=0.8) # plot.matrix::

Figure 79: Comparison of GWR coefficients over time: a) boxplot of local coefficients of selected variable; b) Rand Index for clusters of all coefficients

4.3.4 GWR with Random Forest

This code (for Fig.80) presents the experimental approach to estimating GWR (Geographically Weighted Regression Model) with Random Forest (RF). As GWR is about running local regressions, in any approach one needs to identify points that are around. In order to define which points to include in local subsamples, one can use (at least) two approaches: to get a fixed number of neighbours using kNN() from dbscan::; or to get neighbours in a fixed radius, by using frNN() from dbscan::. The consequence of either approach is to select observations on which to run the RF model. Next, run the RF models using randomForest() from the randomForest:: package in the for() loop, for each observation, and save individual results to objects. Finally, one needs to make summaries of individual results. In the second part (Fig.81), the code shows how to visualise the decision tree, which is similar to the trees run in random forest estimation.

# Fig.80: Boxplots of variable importance for GWR RF

install.packages("randomForest")
library(randomForest)       # for Random Forest
library(dbscan)         # to find neighbours around

nn<-500                         # size of subsample
selector<-sample(1:dim(firms)[1], nn, replace=FALSE) # selected rows
sub.firms<-firms[selector, ]        # subsample in class data.frame 
sub.firms.sf<-firms.sf[selector, ]  # subsample in class sf

# find nearest neighbours – fixed knn
sub.firms.crds<-sub.firms[, 9:10]   # xy coordinates only
aaa<-dbscan::kNN(sub.firms.crds, 10)    # k=10 nearest neighbours 
aaa$id[1,]          # list of 10 nearest neighbours for obs no 1

# alternative - find nearest neigbours – fixed radius
#sub.firms.crds<-sub.firms[, 9:10]  # xy coordinates only
#aaa<-frNN(sub.firms.crds, 0.1)     # radius of about 10 km
#aaa$id[[1]]    # list of points in radius of max.0.1 degree

num.of.obs<-500         # small subsample for faster analysis
num.of.expl.var<-9      # number of explanatory variables in eq

eq<-locAggTOTAL ~ locHightech + locBIG + locPdens +  COREfirms + dist_core_10 + dist_midsize_10 + dist_regional_10 + dist_localbig_10 + dist_localsmall_10 

# This is like a trial run – to get the output structure with names
sub.x<-sub.firms[1:10, c("locHightech", "locBIG", "locPdens", "COREfirms", "dist_core_10", "dist_midsize_10", "dist_regional_10", "dist_localbig_10",  "dist_localsmall_10")]   # set of x variables
sub.y<-sub.firms[1:10, c("locAggTOTAL")]        # y variable
rf.global<-randomForest(sub.x, sub.y, importance=TRUE) # RF model

# Prepare the objects for final outcomes
IncMSE<-matrix(0, nrow=num.of.obs, ncol=num.of.expl.var)
IncNodePurity<-matrix(0, nrow=num.of.obs, ncol=num.of.expl.var)
metrics<-matrix(0, nrow=num.of.obs, ncol=2)
colnames(metrics)<-c("R2", "MSE")

# loop – for each observation we run the RF model
for(i in 1:num.of.obs){ # loop for each obs
nn<-aaa$id[i,]          # get IDs of knn for given obs. 
#nn<-aaa$id[[i]]        # alternative – neighbours for fixed radius
subset<-sub.firms[nn,]  # subset with local neighbours

sub.x<-subset[, c("locHightech", "locBIG", "locPdens", "COREfirms", "dist_core_10", "dist_midsize_10", "dist_regional_10", "dist_localbig_10",  "dist_localsmall_10")]  # set of x variables
sub.y<-subset[, c("locAggTOTAL")]           # set of y variables

rf.global<-randomForest(sub.x, sub.y, importance=TRUE) # RF model
IncMSE[i,]<-t(rf.global$importance[,1])     # save importance
IncNodePurity[i,]<-t(rf.global$importance[,2])  # save importance
metrics[i,1]<-mean(rf.global$rsq)           # save R2
metrics[i,2]<-mean(rf.global$mse)           # save MSE
}                                   # end of the loop

# one can add names of objects from output objects
colnames(IncMSE)<-rownames(rf.global$importance)
colnames(IncNodePurity)<-rownames(rf.global$importance)

# plot variable importance
par(bg="cornsilk")      # background colour

# Plot the MSE splitting rule, Fig.80a
boxplot(IncMSE/100000, horizontal=TRUE, main="Split rule: MSE", ylim=c(0, 50), axes=FALSE, col="cornflowerblue")
text(45, 1:9, colnames(IncMSE))
axis(1)

# Plot the Gini Impurity splitting rule, Fig.80b
boxplot(IncNodePurity/100000000, horizontal=TRUE, main="Split rule: Gini Impurity", ylim=c(0, 50), axes=FALSE, col="darkseagreen3")
text(45, 1:9, colnames(IncMSE))
axis(1)

Figure 80: Boxplots of variable importance for GWRF: a) Mean Squared Error (values divided by 100 000); b) Gini impurity (values divided by 100 million)

# Fig.81: Visualization of random tree (based on Fig.80)

install.packages("rpart")       # for rpart()
install.packages("rpart.plot")  # for prp()

library(rpart) 
library(rpart.plot) 

cart<-rpart(eq, data=sub.firms, cp=0.001) # recursive partitioning
prp(cart, type=4, extra=1) # plot rpart model

Figure 81: Decision tree for a global model: local agglomeration explained by the local agglomeration of high-tech firms (locHight), big firms (locBIG) and population (locPdens)

4.4 Bootstrapped model with Voronoi polygons and density-calibrated space

The bootstrapped spatial econometric model is an advanced algorithm that produces the results that are simple and easy to interpret. Many intermediate steps are necessary to obtain the solution. Therefore only the automated version using the BootSpatReg() command from the spatialWarsaw:: package is presented - in the first part of the code (Fig.82). Based on these results there is the code the plot the regression coeffcients that were normalised using data.Normalization() (Fig.82a). This part also includes the command ApproxSERoot2() from spatialWarsaw::, which allows to recalculate the standard errors of the estimates from small to large samples according to the √2 rule (Fig.82b). It also shows how to reduce the dimensions using MDS (smacof::mds()) and plot the competing models in 2D, together with the best model selected with PAM using factoextra::eclust(). The code also shows how to make predictions in spatial econometric models run on point data with SpatPredTess(): for in-sample data (Fig.82d) and out-of-sample data (Fig.83). These predictions require spatial weight matrix, which has been developed using the k nearest neighbours criterion using a few functions: nb2listw(), make.sym.nb(), knn2nb(), knearneigh() from the spdep:: package.

# Fig.82: Bootstrapped regression

library(spatialWarsaw) # for automated bootstrapped regression model

# Equation to estimate
eq<-locAggTOTAL ~ locHightech + locBIG + locPdens +  COREfirms + dist_core_10 + dist_midsize_10 + dist_regional_10 + dist_localbig_10 + dist_localsmall_10 

# Estimation of the bootstrapped model with spatialWarsaw::
# Not very demanding parameters: 5 iterations, 1500 observations
bsr<-BootSpatReg(firms.sf, 50, 3000, eq, "SEM", knn=5)
head(bsr)    # first 6 rows of the full output
head(rownames(bsr$data.best),10) #IDs of data used in the best model

# Tab.20 – the best bootstrapped model
summary(bsr$model.best) # summary with p-value of the best model

Table 20: Best model from bootstrap alternatives

# boxplot of normalised coefficients – Fig.82a
library(clusterSim)
coef.n<-data.Normalization(bsr$coef.boot, type="n1", normalization="column")
coef.n<-coef.n[,1:dim(coef.n)[2]]

par(bg="white")             # colour of background 
boxplot(coef.n, main="Bootstrapped coefficients (normalised)", horizontal=TRUE, ylim=c(-2.5,5), axes=FALSE)
text(3.5, 1:10, colnames(coef.n))   # labels of boxes
axis(1)                     # x axis
abline(v=c(-2:2), lty=3, col="grey80")  # vertical links

# approximation of standard errors depending on sample size
asr<-spatialWarsaw::ApproxSERoot2(bsr$model.best) # Fig.82b
asr

# Fig.82c – MDS visualisation of coefficients with PAM-selected best model (in red)
library(factoextra) # PAM clustering into one cluster
best.model<-eclust(coef.n, "pam", hc_metric="euclidean", k=1)
best.model$id.med   # ID of best model

# Percentage difference between observed and fitted values – Fig.82c
# data conversion
firms.sf.best<-firms.sf[rownames(bsr$data.best), ]

# Spatial weight matrix for prediction model
# symmetric k nearest neighbours (knn=5) matrix
library(spdep)
knn.sym.listw<-spdep::nb2listw(make.sym.nb(knn2nb(knearneigh(firms.sf.best, k=5))))

pred<-SpatPredTess(bsr$model.best, firms.sf.best, knn.sym.listw, firms.sf.best, region) # spatialWarsaw::

preds<-as.data.frame(pred$forecast.result)
preds$diff<-(preds[,1]-preds[,2])/preds[,2]
colnames(preds)<-c("predicted_y","real_y", "lon", "lat", "diff.sq", "diff")
preds.sf<-st_as_sf(preds, coords=c("lon", "lat"), crs=4326)

ggplot(region) + geom_point(data=preds, alpha=0.5, aes(x=lon, y=lat, color=diff)) + scale_colour_viridis_c(option="C")

# MDS to reduce dimensions – for Fig.82d
library(smacof)                 # for mds()
boot.mds<-mds(dist(coef.n), ndim=2,  type="mspline") # smacof::
boot.mds$conf                   # coordinates of points 

#  plot of MDS representation of models with the best PAM model
par(mar=c(3,3,3,3))
plot(as.data.frame(boot.mds$conf), cex=1.5, pch=21, bg="grey50") # plot of models in 2D
points(as.data.frame(boot.mds$conf)[best.model$id.med,], pch=15, bg="grey20", cex=3)
title(main="MDS representation of bootstrapped coefficients
with the best model selected using PAM")
legend("bottomleft", c("bootstrapped models","best model"), pch=c(21,15), col=c("grey50", "grey20"), bty="n", pt.bg=c("grey50", "grey20"), pt.cex=c(1.5,3))

Figure 82: Bootstrapped regression: a) boxplots of bootstrapped (normalised) regression coefficients; b) approximation of SE of the regression coefficients in a larger sample, according to the √2 rule and SE multipliers, c) percentage difference between observed and predicted (fitted) values for in-sample observations, d) MDS representation of bootstrapped coefficients with the best model selected using PAM

# Fig.83: Tessellated representative subsample and location of new points (in red) (continued from Fig.82)
 
# Coordinates in sf class - observations only used in the best model
# Required for knn spatial weight matrix
crds.sf<-st_centroid(st_geometry(firms.sf[rownames(bsr$data.best),]))
     
# Spatial weight matrix for prediction model
# symmetric k nearest neighbours (knn=5) matrix
library(spdep)
knn.sym.listw<-nb2listw(make.sym.nb(knn2nb(knearneigh(crds.sf, k=5))))
     
# Predict using Voronoi tiles, uses the best model from Fig.82
pred<-SpatPredTess(bsr$model.best, firms.sf[rownames(bsr$data.best), ], knn.sym.listw, firms.sf[1:100,], region) # spatialWarsaw::
pred

Figure 83: Tessellated best subsample and location of new points (in red)

4.5 Econometric and machine learning models on gridded data

The code for Fig.84 has been divided into two parts: the first one (‘pre-code’) that explains how to cut and create subsets of spatial data and the second one (‘proper code’) which shows how to perform econometric and machine learning analyses. The pre-code shows how to limit the spatial objects by name included in the database (conditional operation) or by contour using the st_join() command. The proper code reads the limited shapefiles (contour, grid) and plots a few layers.

The code for Fig.84 shows how to work with the grid. The first issue is plotting. As the grid cells are small (1km x 1 km), they are hard to see for larger territories such as the NUTS2 region. Therefore, the code shows how to cut out a grid for a small area (e.g. municipality, city) to make it easier to plot - it uses a predefined contour and the st_join() command. Sorting out which grid belongs to a given area and which does not is done with the which() command which lists rows that satisfy a condition (e.g. equal the required value of a variable). Plotting with the ggplot() and geom_sf() commands is efficient for limited (small) datasets. The second issue is the aggregation of point data to the grid – to know which point belongs to which grid cell. The approach is to enumerate the grid cells (from 1 in the top left corner to the last value in the bottom right corner), to label each point to which grid cell it belongs and to aggregate the point data (and its associated information) by the grid ID using the aggregate() command. As the result of the aggregation is a separate data.frame object and it needs to be merged by grid ID to the full grid in the sf class – it requires an intelligent search by ‘key variables’ with the merge() command. To ensure that the sort order is correct, the sf object is re-sorted using the arrange() command from the dplyr:: package within pipe %>% operation.

# Fig.84: Gridded data: a) population grid for Warsaw (census, 2011); b) number of firms (agglomeration) 

#*** Pre-code *** (not necessary for replicating calculations)
# Read larger maps (original files) and select appropriate areas

# POW is map for poviats – NUTS4 / LAU1 division level
# it contains 380 areas
POW<-st_read("powiaty.shp", stringsAsFactors = FALSE) # with sf::
POW$jpt_nazwa_<-iconv(POW$jpt_nazwa_, "latin1", "UTF-8")
POW<-st_transform(POW, 4326) # projections: 4326=WGS84, 4267=NAD27 

# NTS2 is a map for voivodships – NUTS2 division level
# it contains 16 areas
NTS2<-st_read("wojewodztwa.shp", stringsAsFactors = FALSE)
NTS2$jpt_nazwa_<-iconv(NTS2$jpt_nazwa_, "latin1", "UTF-8")
NTS2<-st_transform(NTS2, 4326) # projections: 4326=WGS84, 4267=NAD27 
NTS2    # in bold elements of the interest – name of region

# read grid with population data + change the projection
# it contains 315857 areas
POP<-st_read("PD_STAT_GRID_CELL_2011.shp", stringsAsFactors=FALSE)
POP<-st_transform(POP, 4326)
names(POP)
dim(POP)

#*** Cut elements from larger maps
# Single NUTS2 region
region<-NTS2[NTS2$jpt_nazwa_=="mazowieckie", ]  # selected region
names(region)

# Limit the grid to selected contour of NUTS2 region
# to get about 36K units from about 315K units
# as output one gets tags for grid cells if they belong the contour
# or not - databases are merged: each grid cell retains original
# content and additionally gets information from contour 

temp1<-st_join(POP, region, join=st_intersects)
names(temp1) # in bold from POP, the rest from region object
dim(temp1)
unique(temp1$jpt_nazwa)

# cut the grid 
a<-which(temp1$jpt_nazwa_=="mazowieckie") # choose cells by tag
length(a)       # [1] 36473
POP.region<-POP[a,]
st_write(POP.region, "POP_region.shp")  # write a shp file

############################################################
# here starts the proper code for replication
############################################################
# read population grid (shapefile) into sf class
POP.region<-st_read("POP_region.shp", stringsAsFactors=FALSE)
POP.region<-st_transform(POP.region, 4326)

# The ‘starter’ section reads the NTS4 object – contours of poviats
# The ‘starter’ section also reads NTS4_data object – data for NTS4
summary(NTS4)       # summary of contour object
summary(NTS4_data)  # summary of data set

# integrate rownames of NTS4 and NTS4_data
NTS4_data$jpt_nazwa_<-NTS4$jpt_nazwa_ # copy-paste between objects
head(NTS4_data)

# plot NUTS4/LAU1 and NUTS2 maps to check their overlap
# one gets two overlaid contour maps
library(sp)         # for geographical axes (with degrees)
bb<-st_bbox(region)         # bounding box to know xy limits
par(mar=c(3,3,1,1))     # proper margins of plot
plot(st_geometry(NTS4), xlim=c(19,23.5), ylim=c(51,53.5))
plot(st_geometry(region), border="red", lwd=2, add=TRUE)
degAxis(1);degAxis(2)       # from sp::

# plot population grid with transparent grid borders
ggplot() + geom_sf(data=POP.region, aes(fill=TOT), color="transparent")

# Fig.84: Grids for Warsaw: population (from census) and firms (aggregated)
# Plot Warsaw contour – object created in ‘starter’ section
plot(st_geometry(Warszawa))

# Cut grid cells within the contour of Warsaw
temp2<-st_join(POP.region, Warszawa, join=st_intersects)
unique(temp2$jpt_nazwa)
a2<-which(temp2$jpt_nazwa_=="powiat Warszawa")
POP.Warszawa<-POP.region[a2,]

ggplot() + geom_sf(data=POP.Warszawa, aes(fill=TOT)) # Fig.84a

# Aggregate firms to grid cells – for whole region
# Trick is to give IDs to grid cells and then aggregate by those IDs
# The second trick is to label each firm as 1 and then sum those 1s
POP.region$myID<-1:dim(POP.region)[1]
firms.sf$ones<-rep(1, times=dim(firms.sf)[1])
temp3<-st_join(firms.sf, POP.region, join=st_intersects)
firms.agg<-aggregate(temp3$ones, by=list(temp3$myID), sum, na.rm=TRUE)
colnames(firms.agg)<-c("myID", "agglom")
head(firms.agg)

# Merge number of firms (firms.agg) to population grid (by myID)
# all grid cells are kept, not in all grid cells firms appear
# unfortunately sort option not always works
POP.region<-merge(POP.region, firms.agg, by.x="myID", by.y="myID", all.x=TRUE, sort=FALSE)
a<-which(is.na(POP.region$agglom)==TRUE)
POP.region$agglom[a]<-0

# Sort back the sf object by myID (original order)
# it is necessary to use a previously created vector that indicates 
# which rows are within the contour of Warsaw
library(dplyr)  # fir pipe %>% operations
POP.region <- POP.region %>% arrange(myID)

# Plot grid of Warsaw & number of firms aggregated from points
# Select city of Warsaw – as previously
# difference to previous operation is now it includes new variable
POP.Warszawa<-POP.region[a2,]
ggplot() + geom_sf(data=POP.Warszawa, aes(fill=agglom)) # Fig.84b

Figure 84: Gridded data for Warsaw: a) population; b) number of firms

The code for Tab.22 is a follow-up of the previous codes. First, it shows how to create some new variables. There are several approaches: it aggregates point data into grid cells (as in Fig.84); it calculates the distance between individual points – centroids of grid cells (found with st_coordinates() and st_centroid()) and a selected point (e.g. city) using the spDistsN1() command from the sp:: package. It also performs an arithmetic operation on existing variables (division of variables stored in grid), which decomposes regional data to grid level (by assigning the same value to each grid within a given region) with st_join(). Thus prepared grid dataset with variables recoded from other granulation levels can be well used for OLS and spatial econometric model estimation (using spatial weight matrix W).

# Tab.22: Create new variables in a grid and run regressions (code continued from Fig.84)
# add new variables to the population grid 

# Number of high-tech firms fir each cell aggregated from points
hightech.agg<-aggregate(temp3$if_hightech, by=list(temp3$myID), sum, na.rm=TRUE)
colnames(hightech.agg)<-c("myID", "hightech.agg")
POP.region<-merge(POP.region, hightech.agg, by.x="myID", by.y="myID", all.x=TRUE, sort=FALSE)
POP.region <- POP.region %>% arrange(myID) # sort grid
a<-which(is.na(POP.region$hightech.agg)==TRUE) # rows with NA
POP.region$hightech.agg[a]<-0   # change NA to 0

# Dummy if given grid cell contains at least one high-tech firm
POP.region$if.hightech<-0
POP.region$if.hightech[POP.region$hightech.agg>0]<-1

# distance from each grid (centroid) to center of Warsaw
grids.xy<-st_coordinates(st_centroid(POP.region))
Warsaw.xy<-st_point(x=c(20.98, 52.21), dim="XY")
POP.region$dist<-sp::spDistsN1(grids.xy, Warsaw.xy, longlat=TRUE)

# labour force - working age population to the total population
POP.region$labour.force<-POP.region$TOT_15_64/POP.region$TOT
POP.region$labour.force[is.na(POP.region$labour.force)==TRUE]<-0
a<-which(is.infinite(POP.region$labour.force)==TRUE)
POP.region$labour.force[a]<-0

# Decompose unemployment rate from NTS4 data from grid cells
NTS4$unemp<-NTS4_data$unemploment_rate
ggplot() + geom_sf(data=NTS4, aes(fill=unemp)) # plot of variable

# Overlay the poviat map on the grid 
temp4<-st_join(POP.region, NTS4, join=st_covered_by)
POP.region$unemp<-temp4$unemp
a<-which(is.na(POP.region$unemp)==TRUE)     # search for NA
POP.region$unemp[a]<-0              # NA changed to 0
ggplot() + geom_sf(data=POP.region, aes(fill=unemp), color="transparent")               # plot of variable by grids

# Estimate the OLS model on the grid
eq<-agglom ~ TOT + dist + unemp + labour.force + hightech.agg + if.hightech # equation to estimate

model.ols<-lm(eq, data=POP.region)          # linear model
summary(model.ols)

# estimate spatial model on the grid
# contiguity matrix W for grid for spatial model
cont.sf<- poly2nb(POP.region)               # class nb
cont.listw<-nb2listw(cont.sf, style="W")        # class listw
cont.listw                          # summary of W

library(spatialreg)
model.error<-errorsarlm(eq, data=POP.region, cont.listw,  zero.policy=TRUE, method="LU", tol.solve=1.0e-50) 
summary(model.error, Nagelkerke=TRUE)   # summary of model

# summarise both models in a one table
library(texreg)
screenreg(list(model.ols, model.error), custom.model.names=c("model ols", "model spatial error")) # tabular summary of two models

Table 22: Estimation results for model on gridded data

The codes for Fig.85, Fig.86, Tab.23 and Tab.24 use the prepared grid with data for ANN (Artificial Neural Network) estimation. This requires an equation with a categorical variable (e.g. dummy) as the dependent variable. ANN models are computationally intensive, so pilot studies should be run on subsamples (as in the example). ANN uses the neuralnet() command from the neuralnet:: package – it produces visualisation using plot() and objects with weights etc. Predictions require the predict() function – real and predicted values can be tabulated using table(). As ANN predictions are formed as a probability of a given label, the max() function should be used to select the correct label from the output. Prediction quality and variable importance are available from the caret:: and NeuralNetTools:: packages with the train() and garson()commands .

# Fig.85: Artificial Neural Network (ANN)

# Equation with dummy variable as dependent variable
eq<-if.hightech ~ agglom + TOT + dist + unemp + labour.force

# Subset from grid – to speed up computations
selector<-sample(1:dim(POP.region)[1], 2000, replace=FALSE)
POP.region.sub<-POP.region[selector, ]

# Artificial Neural Network
install.packages("neuralnet")
library(neuralnet)
model.nn<-neuralnet(eq, data=POP.region.sub, hidden=c(2,2), linear.output=FALSE)        # ANN with neuralnet::
model.nn                # results of estimation
model.nn$result.matrix  # weights and quality metrics
plot(model.nn, rep="best")  # visualisation, Fig.85

Figure 85: Graph of the Artificial Neural Network for the analysed equation

# Tab.23: Predictions from the ANN model (continued from Fig.84)

# predictions
pred<-predict(model.nn, as.data.frame(POP.region.sub))
# confusion matrix (actual vs. predicted data)
table(POP.region.sub$if.hightech, apply(pred, 1, which.max))

Table 23: Confusion matrix for ANN model

# Tab.24 & Fig.86: Finetuning of the ANN model

install.packages("caret")
install.packages("NeuralNetTools")
library(caret)
library(NeuralNetTools)
model.caret<-caret::train(eq, method='nnet', data=POP.region, linout=TRUE)
NeuralNetTools::garson(model.caret) # variable importance, Fig.86
model.caret                 # Tab.24

Table 24: Fine-tuning of parameters in ANN

Figure 86: Variable importance in the ANN model

4.6 Elasticity model for local density

The first part of the code (Fig.87) creates logarithms of the data using the log() command, plots their statistical densities using density() and boxplots using boxplot(). The second part of the code (Fig.87, Tab.25) estimates the global OLS elasticity model and local GWR elasticity models to see the spatial relationship between population density and firm agglomeration. OLS is typically estimated with the lm() command, while GWR uses, as in the previous examples, uses the GWmodel:: package and the gw.dist(), bw.gwr() and gwr.basic() commands. The last part of the code displays the k-means clusters of the GWR coefficients interactively using mapview() and as a hexagonal plot.

# Fig.87: Relation between population density and agglomeration of firms

nn<-10000                       # size of subsample
selector<-sample(1:dim(firms)[1], nn, replace=FALSE) # selected rows
sub.firms<-firms[selector, ]        # subsample of firms in df

# Create logarithms of variables
sub.firms$locAggTOTAL.1<-sub.firms$locAggTOTAL+1
sub.firms$locPdens.1<-sub.firms$locPdens+1
sub.firms$log.locAggTOTAL.1<-log(sub.firms$locAggTOTAL.1)
sub.firms$log.locPdens.1<-log(sub.firms$locPdens.1)

# Statistical density of two variables – Fig.87a
plot(density(sub.firms$log.locPdens.1), xlim=c(0,10), xlab="values in log", ylab="", main="", ylim=c(0,0.35), axes=FALSE)
lines(density(sub.firms$log.locAggTOTAL.1), lwd=2)
axis(1); axis(2)
text(8,0.25, "agglomeration of firms")
text(4.5,0.325, "density of population")

# Boxplot – Fig.87b
boxplot(firms$locAggTOTAL~cut(firms$locPdens, breaks=0:16*25), xlab="Population density *100", ylab="Agglomeration in radius of 500 m", col="cornflowerblue")
abline(h=0:6*2000, lty=3, col="grey70")

Figure 87: Relationship between population density and business agglomeration: a) density distributions of logarithms of population density and business agglomeration; b) boxplots of number of firms in intervals of population density

# Tab.25 & Fig.88: Elasticity models – OLS and GWR (continued from Fig.87)

# Equation to estimate – on logarithms
eq<-log.locAggTOTAL.1~log.locPdens.1 + dist_core_10 + dist_core_25 + dist_midsize_10 + dist_midsize_25 + dist_regional_10 + dist_regional_25 + dist_localbig_10 + dist_localbig_25 + dist_localsmall_10 + dist_localsmall_25

model.ols<-lm(eq, data=sub.firms) # OLS model
summary(model.ols)

library(GWmodel) # for GWR model
# transformations of objects: subset, sp class
sub.firms.sp<-sub.firms
coordinates(sub.firms.sp)<-c("lon", "lat")
crds<-coordinates(sub.firms.sp)

# estimation of GWR with GWmodel::
dMat<-gw.dist(cbind(sub.firms$lon, sub.firms$lat))
bw<-GWmodel::bw.gwr(eq, data=sub.firms.sp, kernel='gaussian', adaptive=FALSE, dMat=dMat)
model.gwr<-GWmodel::gwr.basic(eq, data=sub.firms.sp, kernel='gaussian', adaptive=FALSE, bw=0.15, dMat=dMat)  
model.gwr
head(model.gwr$SDF)     # see the model coefficients

Table 25: Elasticity model with localised effects estimated with OLS and GWR

# map of GWR coefficients – Fig.88a
# elasticity coefficient coefs[,2] corrected with beta for dummies
coefs<-as.data.frame(model.gwr$SDF)
variable<-coefs[,2]+coefs[,3]*sub.firms[,"dist_core_10"]+coefs[,4]*sub.firms[,"dist_core_25"]+coefs[,5]*sub.firms[,"dist_midsize_10"]+coefs[,6]*sub.firms[,"dist_midsize_25"]+coefs[,7]*sub.firms[,"dist_regional_10"]+coefs[,8]*sub.firms[,"dist_regional_25"]+coefs[,9]*sub.firms[,"dist_localbig_10"]+coefs[,10]*sub.firms[,"dist_localbig_25"]+coefs[,11]*sub.firms[,"dist_localsmall_10"]+coefs[,12]*sub.firms[,"dist_localsmall_25"]

cols<-rainbow(10)
stat<-summary(variable)
brks<-c(stat[1], stat[2], stat[3], stat[5], stat[6])

par(mar=c(1,2,2,2))
plot(st_geometry(region), border="grey50")
points(crds, col=cols[findInterval(variable, brks)], pch=16)
legend("topright", legend=paste("from ",round(brks,3)[1:4], "to ", round(brks,3)[2:5]), fill=cols, cex=0.8, bty="n")
title(main=paste0("GWR coefficients of selected variable by quartiles"))

# boxplot of normalised GWR coeeficients – Fig.88b
library(clusterSim)
coefs<-as.data.frame(model.gwr$SDF[,1:10])[,1:10]
par(mar=c(3,3,3,3))
boxplot(data.Normalization(coefs, type="n1"), horizontal=TRUE, ylab=" ", axes=FALSE, ylim=c(-40, 20))
axis(1)
text(rep(-30, times=10), 1:10, names(model.gwr$SDF[,1:10]))
title(main="Normalised GWR coefficients")
abline(v=0, lty=3, col="grey80")

Figure 88: Visualisation of GWR estimation: a) elasticity coefficient corrected with dummy coefficients; b) boxplots of normalized coefficients (with an interval of -3/+3 standard deviations)

# Fig.89: Interactive and hexagonal plots of k-means clusters

# interactive plot of k-means clusters of GWR coefficients Fig.89a
sub.firms.sf<-st_as_sf(sub.firms, coords=c("lon", "lat"), crs=4326)
km<-kmeans(as.data.frame(model.gwr$SDF[,1:10]), 10)
sub.firms.sf$cluster<-km$cluster
library(mapview)
mapview(sub.firms.sf, zcol="cluster")

# clusters aggregated to hexagons – Fig.89b
hex.sf<-st_make_grid(region, square=FALSE, n=c(50,40)) # sfc class
hex.sf<-st_as_sf(hex.sf)    # sf class

hex.sf$myID<-1:dim(hex.sf)[1]   # ID added to sf object
temp5<-st_join(sub.firms.sf, hex.sf, join=st_intersects) # aggregte
temp5$clust<-km$cluster # transfer cluster ID beweeten objects 

hex.agg<-aggregate(temp5$clust, by=list(temp5$myID), median, na.rm=TRUE)    # aggregate cluster ID within hexagons
colnames(hex.agg)<-c("myID", "clust")   # changing names of columns

hex.sf<-merge(hex.sf, hex.agg, by.x="myID", by.y="myID", all.x=TRUE, sort=FALSE) # merging (intelligent search and paste)
library(dplyr)
hex.sf <- hex.sf %>% arrange(myID) # sorting

ggplot(data=hex.sf) + geom_sf(aes(fill=clust)) + geom_sf(data=region, color="black", fill="transparent") + theme(legend.position="none") # plotting

Figure 89: K-means clusters of coefficients from the GWR model: a) plotted interactively; b) aggregated to hexagons

Extra codes for data processing

Point data where the location is given as a postal address can be geocoded in R using the tidygeocoder:: R package. Below is an example of finding (x,y) geo-coordinates for a given address using OpenStreetMap (OSM). It constructs the input data using tribble() from the tibble:: package and geocodes using the geocode() function from the tidygeocoder:: package. The result was plotted using the ggplot() command and labelled using the geom_label_repel() command from the ggrepel:: package.

# geo-coding of addresses
install.packages('tidygeocoder')
library(dplyr)
library(tidygeocoder)

locations <- tibble::tribble(
~name,                  ~addr,
"Warsaw, Royal Castle",          "plac Zamkowy 4, 00-277 Warszawa, Polska",
"Chopin Airport",   "Żwirki i Wigury 1, 02-143 Warszawa, Polska",     
"Modlin Airport",   "Generała Wiktora Thommée, 05-102 Nowy Dwór Mazowiecki, Polska", 
"Chopin’s birthplace", "Żelazowa Wola 15, 96-503 Sochaczew, Polska")

# geocode the addresses
locations.xy <- locations %>%
  geocode(addr, method='osm', lat=latitude, long=longitude)
locations.xy

library(ggplot2)
library(ggrepel)        # for text in boxes
ggplot() + geom_sf(data=region) + geom_point(data=locations.xy, aes(longitude, latitude), color = "grey99") + ggrepel::geom_label_repel(aes(longitude, latitude, label=name), data=locations.xy) + theme_void()

density_book

Katarzyna Kopczewska

2025-02-09