Reproducible codes for QDC (Quick Density Clustering)

This document shows how to reproduce the graphics and analysis that were presented in a paper on “Analysing local spatial density of human activity with quick density clustering (QDC) algorithm”

Kopczewska K. (2025), Analysing local spatial density of human activity with quick density clustering (QDC) algorithm, Computers, Environment and Urban Systems,

Reading the data for the analysis

Datasets for point data are available at Figshare, please download:

for population is at https://figshare.com/s/b0feffe666b4f4bece93?file=39484516 - these are the locations (x,y) of points (always needed)
for firms at https://figshare.com/articles/dataset/Firms_in_Mazovian_region_Poland_2012_/22215634?file=46207716 - these are the locations (x,y) of points (always needed)
for administrative contours (shapefiles) https://gis-support.pl/baza-wiedzy-2/dane-do-pobrania/granice-administracyjne/ - this is bounding box for points (nice to have)
for location of cities in Mazovian region: https://figshare.com/articles/dataset/Location_of_cities_in_Mazovian_region_Poland_/28509350 - these are the locations (x,y) of points (nice to have)

The code below reads data.

# read packages
library(sf)
library(terra)
library(ggplot2)
library(dbscan)
library(spatialWarsaw)
library(osmdata)
library(sp)

# set Working Directory 
setwd("C:/Users/data")

# read maps into sf class
WOJ<-st_read("wojewodztwa.shp") # read shapefile for 16 regions
WOJ<-st_transform(WOJ, 4326)    # convert to WGS84
region<-WOJ[WOJ$jpt_nazwa=="mazowieckie", ] # separate one region

POW<-st_read("powiaty.shp") # read shapefile with 380 subregions
POW<-st_transform(POW, 4326)    # convert to WGS84
Warszawa<-POW[POW$jpt_nazwa=="powiat Warszawa", ]#separate Warsaw

# read point data on firms, population and cities

firms<-read.table("subsample.txt")
firms.sf<-st_as_sf(firms, coords=c("lon", "lat"), crs=4326) 

popul<-read.csv("points_popul_maz.csv", header=TRUE, dec=".", sep=",")
popul.sf<-st_as_sf(popul, coords=c("x", "y"), crs=4326)

miasta<-read.csv("miasta_maz1.csv", header=TRUE, dec=",", sep=";")
miasta.sf<-st_as_sf(miasta, coords=c("coords.x", "coords.y"), crs=4326)

Concept of the algorithm

QDC uses intuitive features of spatial density: distance to other points and intensity of the surroundings, expressed as two spatial variables (Fig.1). These are two complementary spatial metrics.

Distance to other points is expressed as the sum of the distances to the k-nearest neighbours (kNN) - it provides an insight into the compactness of point clusters, where a lower sum indicates higher density. Obviously, the higher the local point density, the closer the kNNs and the lower the total distance. It can take values from almost 0, in the case of a very close neighbourhood, to an undefined value, depending on the spatial extent of the data and the scale of measurement.

The number k for social data should be around 15-30. The second spatial variable, neighbourhood intensity, is the number of neighbours within a fixed radius (frNN) - it captures local density.

Figure 1. Spatial variables to detect the local density of points

The QDC algorithm works with the two spatial variables discussed above. There are only a few steps in QDC to obtain the results: [1] Compute the spatial variables: for each point, compute the sum of the distances to its k-nearest neighbours (kNN) and the number of neighbours within a fixed radius (frNN); [2] Normalise spatial variables: apply z-score normalisation to both spatial variables; [3) Apply K-means clustering: cluster the normalised spatial variables into K density groups, preferably k=3, to obtain low, medium or high density clusters; [4) Classify points: assign points to low, medium or high density clusters based on their group assignment.

Working with simulated data

Generate the mixed point pattern

First, one needs a territory for the generated point patterns.

polygon<-st_sfc(st_polygon(list(cbind(c(0,1,1,0,0),c(0,0,1,1,0)))))
poly.sf<-st_as_sf(polygon, crs=4326) # sf class, projected
plot(st_geometry(polygon)); axis(1); axis(2)

Secondly, one needs points. This code generates the mixed point pattern composed of a random, regular and clustered point pattern. They are inside the previously generated bounding box.

# generate point-patterns
# number of obs. clustered/regular/random point-patterns
nn<-c(1650, 450, 700, 100)   

x<-rnorm(nn[1], 0.25, 0.1)       # clustered skewed pattern (1)
y<-rnorm(nn[1], 0.75, 0.1)
bb1<-data.frame(X=x, Y=y)

b2<-st_sample(poly.sf, nn[2], type="regular")  # regular pattern
bb2<-as.data.frame(st_coordinates(b2))

b3<-st_sample(poly.sf, nn[3], type="random")   # random pattern
bb3<-as.data.frame(st_coordinates(b3))

x<-rnorm(nn[4], 0.65, 0.05)       # clustered skewed pattern (2)
y<-rnorm(nn[4], 0.2, 0.05)
bb4<-data.frame(X=x, Y=y)

bb<-rbind(bb1, bb2, bb3, bb4)       # all points in one object
a<-which(bb[,1]<0 | bb[,2]>1 | bb[,2]<0)
bb<-bb[-a,]
head(bb)
dim(bb)

plot(bb[,1:2], pch=21, bg="black", cex=0.7, main="Points to divide by local density", xlab="X coordinates", ylab="Y coordinates")

Figure 2: Generated point pattern with different local density

Run QDC algorithm on generated data

This code prepares the data in sf class and runs QDC algorithm using the spatialWarsaw:: package

bb$ones<-rep(1, times=dim(bb)[1]) # add a vector of ones
bb.sf<-st_as_sf(bb, coords=c("X", "Y"), crs=4326) # make sf class from data.frame

points.qdc<-QDC(bb.sf, 5000, nclust=3, k=10, eps=0.05) # run QDC using spatialWarsaw:: package
head(points.qdc) # display the outcome of QDC

One can change the names of the columns to create nicer labels in the figure.

colnames(points.qdc)<-c("X", "Y", "knndist1", "frnn1", "Total_distance_to_k_nearest_neighbours", "Number_of_points_around", "Cluster_ID", "outcome.set1") # change the names of the output

ggplot(points.qdc, aes(x=Total_distance_to_k_nearest_neighbours, y=Number_of_points_around, color=Cluster_ID)) + geom_point() + annotate("text", x=1, y=2, label="high density
(x) (low x) – close to k nearest neighbours, 
(y) (high y) - many neighbours around", col="blue") + annotate("text", x=2, y=-0.4, label="low density
(x) (high x) – far to k nearest neighbours, 
(y) (low y) – only a few neighbours around", col="red") + annotate("text", x=0.5, y=0, label="medium density", col="darkgreen")

ggplot(points.qdc, aes(x=X, y=Y, color=Cluster_ID)) + geom_point()

Figure 3. Quick Density Clustering algorithm. (A) Mutual relationship between normalised spatial variables. (B) Geo-location of density clusters of points in Fig.2.

Sensitivity analysis

An important question is whether the parameters matter for the results. In particular, if setting a given size of neighbourhood (k) or radius (eps) significantly changes what we get. The analysis below shows the statistical densities of normalised spatial variables under different scenarios.

QDC with different parameters - look at k and eps.

#QDC with set1 of parameters
qdc1<-QDC(bb.sf, 2870, nclust=3, k=10, eps=0.01)
head(qdc1)

#QDC with set2 of parameters
qdc2<-QDC(bb.sf, 2870, nclust=3, k=100, eps=0.01)
head(qdc2)

# QDC with set3 of parameters
qdc3<-QDC(bb.sf, 2870, nclust=3, k=10, eps=0.1)
head(qdc3)

# QDC with set4 of parameters
qdc4<-QDC(bb.sf, 5000, nclust=3, k=100, eps=0.1)
head(qdc4)

# QDC with set5 of parameters
qdc5<-QDC(bb.sf, 5000, nclust=3, k=250, eps=0.1)
head(qdc5)

# statistical density plots
plot(density(qdc1[,5]), ylim=c(0,1), lwd=0, main="Density distribution 
of normalised spatial variables") 

lines(density(qdc2[,5]), lwd=1, lty=1)  # k=10
lines(density(qdc2[,5]), lwd=2, lty=1)  # k=100
lines(density(qdc5[,5]), lwd=3, lty=1)  # k=250

text(1.7,1, "Total distance to k nearest neighbours")
legend(0.5, 0.98, legend=c("knn=10", "knn=100", "knn=250"), lwd=c(1,2,3),  lty=c(1,1,1), bty="n")

lines(density(qdc1[,6]), lty=3, lwd=1)  # eps=0.01
lines(density(qdc3[,6]), lty=3, lwd=2)  # eps=0.1

text(1.7,0.75, "Number of neighbours within the radius")
legend(0.5, 0.72, legend=c("radius=0.01", "radius=0.1"), lwd=c(1,2),  lty=c(3,3), bty="n")

Figure 4: Statistical density distributions of normalised spatial variables for different parameter scenarios

The (in)sensitivity of the method to parameters can be checked by assessing the degree of agreement between partitions from different scenarios. A typical tool to check whether two cluster assignments are consistent for a given dataset is the Rand Index.

First, spatial variables are created for a set of scenarios and k-means clustering is performed. The procedure was done step by step because the qdc() command applies sampling and observations from different runs are not comparable.

param<-data.frame(ID=1:6, param.k=c(10, 100, 250, 10, 100, 250), param.eps=c(0.01, 0.01, 0.01, 0.1, 0.1, 0.1))

#  ID param.k param.eps
#1  1      10      0.01
#2  2     100      0.01
#3  3     250      0.01
#4  4      10      0.10
#5  5     100      0.10
#6  6     250      0.10

for(i in 1:6){  # iteration by rows

knn.dist<-kNNdist(bb[,1:2], k=param[i,2], all=TRUE)
knn.dist1<-apply(knn.dist, 1, sum)   
temp1<-data.Normalization(knn.dist1, type="n1", normalization="column")
bb<-cbind(bb, temp1)
colnames(bb)[dim(bb)[2]]<-paste0("sp", i, "a")

agg.rad<-frNN(as.matrix(bb[,1:2]), eps=param[i,3])
agg.rad1<-unlist(lapply(agg.rad$id, length))
temp2<-data.Normalization(agg.rad1, type="n1", normalization="column")
bb<-cbind(bb, temp2)
colnames(bb)[dim(bb)[2]]<-paste0("sp", i, "b")

temp3<-kmeans(bb[ ,c(paste0("sp", i, "a"), paste0("sp", i, "b"))], 3)$cluster # 3 clusters
bb<-cbind(bb, temp3)
colnames(bb)[dim(bb)[2]]<-paste0("km", i)
}

Secondly, a Rand Index is calculated for all pairs of divisions (e.g. km1 and km2; km1 and km3, etc.).

# Rand Index
library(clv)

RI<-matrix(0, nrow=6, ncol=6) # place to put the output

# one will get clust_1_2, clust_1_2, clust_1_3, clust_1_4, etc…
# and rand_1_1, rand_1_2, rand_1_3, …
for(i in 1:6){
for(j in 1:6){
temp4<-std.ext(bb[,paste0("km", i)], bb[,paste0("km", j)])
assign(paste0("clust_",i,"_",j),temp4)
temp5<-clv.Rand(get(paste0("clust_",i,"_",j)))
assign(paste0("rand_",i,"_",j),temp5)
RI[i,j]<-temp5
}}
RI

Table 1: Rand Index for alternative scenarios of QDC parameters

One can also plot the matrix.

diag(RI)<-NA # diagonal with NA for better plotting
colnames(RI)<-c("k=10 eps=0.01", "k=100 eps=0.01", "k=250 eps=0.01", "k=10 eps=0.1", "k=100 eps=0.1", "k=250 eps=0.1") 
rownames(RI)<-c("k=10 eps=0.01", "k=100 eps=0.01", "k=250 eps=0.01", "k=10 eps=0.1", "k=100 eps=0.1", "k=250 eps=0.1") 

library(plot.matrix)        # for plotting coloured matrix
library(RColorBrewer)       # for colour palette
p6<-brewer.pal(n=6, name="YlGn")    # 6 colours, yellow to green
par(mar=c(3,3,3,4))         # wide margins

plot(RI, col=p6, na.col="white", main="Rand Index for alternative scenarios of QDC parameters", xlab="", ylab="", cex.axis=0.7)

Outcomes from alternative solutions

Alternative measures do not provide the density clusters as offered by QDC.

DBSCAN (Fig.4a) finds many clusters within the general high density cluster, leaving medium and low density points as unclustered noise. The separated clusters are not necessarily of different local density as the main criterion for separation is spatial discontinuity.

The KDE algorithm (Fig. 4b) produces the density surface - it detects locations of homogeneous local density well, but it does not assign points to specific clusters. The result is KDE values in particular pixels of the figure (128 x 128) and the classification for raw spatial points is not reported.

# DBSCAN plot
library(dbscan)
db<-dbscan(bb[, 1:2], eps=0.01, minPts=10)
plot(bb[, 1:2], col=db$cluster+1, pch=20)
title(main="DBSCAN (eps=0.01, minPts=10, n=2900)")

# KDE plot
ggplot() + geom_sf(data=poly.sf) + stat_density2d(aes(x=X, y=Y, fill=after_stat(level), alpha=0.25), linewidth=0.01, bins=30, data=bb, geom="polygon")

# DPC plot
library(ADPclust)
dpc.popul<-adpclust(bb, centroids="user")

ggplot(bb, aes(x=X, y=Y)) + geom_point(shape=20, size=3, aes(color=dpc.popul$clusters)) + scale_color_viridis_c(name="DPC clusters") + labs(title="DPC clusters for population", x="Longitude", y="Latitude")

# k-means for coordinates
kmeans.popul<-kmeans(bb, centers=4)

ggplot(bb, aes(x=X, y=Y)) + geom_point(shape=20, size=3, aes(color=kmeans.popul$cluster)) + scale_color_viridis_c(name="K-means") + labs(title="K-means clusters of coordinates for population", x="Longitude", y="Latitude")

Figure 5: Results of the alternative methods. (A) DBSCAN. (B) KDE

Rand Index for DBSCAN to assess sensitivity to parameter changes

This part of code produces the matrix of Rand Index for DBSCAN partitionings with different parameters.

# set of parameters
param<-data.frame(ID=1:6, param.minPts=c(10, 100, 250, 10, 100, 250), param.eps=c(0.01, 0.01, 0.01, 0.1, 0.1, 0.1))

# DBSCAN for each parameter
for(i in 1:6){  # iteration by rows
db<-dbscan(bb[, 1:2], eps=param[i,3], minPts=param[i,2])
assign(paste0("db_",i), db)}

# illustration of DBSCAN outcomes
par(mfrow=c(2,3))
plot(bb[, 1:2], col=db_1$cluster+1, pch=20)
plot(bb[, 1:2], col=db_2$cluster+1, pch=20)
plot(bb[, 1:2], col=db_3$cluster+1, pch=20)
plot(bb[, 1:2], col=db_4$cluster+1, pch=20)
plot(bb[, 1:2], col=db_5$cluster+1, pch=20)
plot(bb[, 1:2], col=db_6$cluster+1, pch=20)
par(mfrow=c(1,1))

# matrix of Rand Index (RI)

library(clv)

# place to put the output
RI<-matrix(0, nrow=6, ncol=6)

# Rand Index for pairs of k-means clustering vectors (of KDE values)
for(i in 1:6){ # 1:6, start of the first loop
for(j in 1:6){ # 1:6, start of the second loop
A<-get(paste0("db_", i))   # DBSCAN for i object
B<-get(paste0("db_", j))   # DBSCAN for j object
external.ind<-std.ext(A$cluster+1, B$cluster+1) # object for clv::
RI[i,j]<-clv.Rand(external.ind)       # Rand Index 
}}    
RI

QDC for empirical data - population and business location

Visualisation of data

Let’s start by visualising the data.

# making smaller subsample - selecting radnom rows
nn<-50000                        # size of subsample
selector<-sample(1:dim(firms)[1], nn, replace=FALSE) # selected rows
sub.firms<-firms[selector, ]        # subsample of firms in df

# plot of firms
ggplot(sub.firms, aes(x=lon, y=lat)) +  geom_point(size=0.5)+ theme(legend.position="none") + labs(title="Location of firms (sample of 50,000 points)", x="Longitude", y="Latitude") 

# plotr of population
ggplot(popul, aes(x=x, y=y)) +  geom_point(size=0.5)+ theme(legend.position="none") + labs(title="Location of population (65,660 points)", x="Longitude", y="Latitude")

Figure 6: Point data for density analysis. (A) Firms. (B) Population.

QDC for data

QDC for firms

# QDC for firms
firms.qdc<-QDC(firms.sf, 50000, nclust=3, k=10, eps=0.05)
head(firms.qdc)
colnames(firms.qdc)[7]<-"Cluster_ID"

# plots of results
ggplot(firms.qdc, aes(x=knndist1.scaled, y=frnn1.scaled, color=Cluster_ID)) + geom_point() + labs(title="Firms: Scatterplot of normalized spatial variables", x="Total_distance_to_k_nearest_neighbours", y="Number_of_points_around") + annotate("text", x=1, y=2, label="high density", col="red") + annotate("text", x=4, y=-0.5, label="low density", col="blue") + annotate("text", x=1, y=0, label="medium density", col="darkgreen")

ggplot(firms.qdc, aes(x=X, y=Y, color=Cluster_ID)) + geom_point() + labs(title="Firms: Location of points by clusters", x="Longitude", y="Latitude")

and QDC for population

# QDC for population
popul.qdc<-QDC(popul.sf, 50000, nclust=3, k=10, eps=0.05)
head(popul.qdc)
colnames(popul.qdc)[7]<-"Cluster_ID"
head(popul.qdc)

# plots of results
ggplot(popul.qdc, aes(x=knndist1.scaled, y=frnn1.scaled, color=Cluster_ID)) + geom_point() + labs(title="Population: Scatterplot of normalized spatial variables", x="Total_distance_to_k_nearest_neighbours", y="Number_of_points_around") + annotate("text", x=1, y=2, label="high density", col="red") + annotate("text", x=4, y=-0.5, label="low density", col="blue") + annotate("text", x=1, y=0, label="medium density", col="darkgreen")

ggplot(popul.qdc, aes(x=X, y=Y, color=Cluster_ID)) + geom_point() + labs(title="Population: Location of points by clusters", x="Longitude", y="Latitude")

Figure 7. Quick Density Clustering algorithm for firms and population

Rasterisng the QDC results

The geolocated points in Fig.7b and Fig.7d are in the same area but not in the same point locations. Therefore, a comparison of the two clusterings can only be made with a common spatial reference, which in this case is a raster. This is a division of the bounding box (a rectangle built over the extreme points in all directions) into cells. This setting assumes 50 cells per row and 50 cells per column, for a total of 2500 cells over the area and its edges. Within each grid cell, the mode of the cluster ID was calculated.

Fig.8 illustrates this analytical step. Here the differences in the classification of territories into density groups are much more visible. Businesses are more often classified as medium density than the population.

# prepare for rasterizing
bbox<-st_bbox(region)       # bounding box over the contour shape
bbox                    # extreme x and y coordinates

mode <- function(v) {  # mode function to find dominant values
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# building the raster 50 x 50
cells<-50           # number of cells in rows and columns
r<-rast(nrows=cells, ncols=cells, xmin=bbox[1], ymin=bbox[2], xmax=bbox[3], ymax=bbox[4]) # general raster, cells only 
crds.col<-1:2       # in which columns lon and lat are stored

# raster for firms
rast.firms<-rasterize(as.matrix(firms.qdc[, crds.col]), r, value=firms.qdc$outcome.set1,  fun=mode) 

plot(rast.firms, main="Firms density", mar=c(1,1,2,1), bty="n", legend=FALSE, axes=FALSE) 
plot(st_geometry(region), add=TRUE)

# raster for population
rast.popul<-rasterize(as.matrix(popul.qdc[, crds.col]), r, value=popul.qdc$outcome.set1,  fun=mode) 

plot(rast.popul, main="Population density", mar=c(1,1,2,1), bty="n", legend=FALSE, axes=FALSE) 
plot(st_geometry(region), add=TRUE)

Figure 8: Rasterised density clustering. (A) Firms. (B) Population.

Comparison of both rasters

# preparing dataset for comparisons
vec.firms<-as.vector(rast.firms$values)
vec.popul<-as.vector(rast.popul$values)

rast.diff<-NA
rast.diff[vec.firms==vec.popul]<-"the same density class"

rast.diff[vec.firms==0 & vec.popul==1]<-"firms_high, popul_low"
rast.diff[vec.firms==0 & vec.popul==2]<-"firms_high, popul_mid"

rast.diff[vec.firms==1 & vec.popul==0]<-"firms_low, popul_high"
rast.diff[vec.firms==1 & vec.popul==2]<-"firms_low, popul_mid"

rast.diff[vec.firms==2 & vec.popul==0]<-"firms_mid, popul_high"
rast.diff[vec.firms==2 & vec.popul==1]<-"firms_mid, popul_low"

rast.differ<-rast.popul # copy-paste raster structure
values(rast.differ)<-rast.diff  # overwriting the structure with new data
plot(rast.differ, main="Comparison of densities for population and firms location", axes=FALSE)         
plot(st_geometry(region), add=TRUE)

Figure 9a: Comparison of rasterised density classes for population and business location

for better comparisons - getting road and urban structure from OpenStreetMap (OSM)

# bouding box in matrix format for the NTS region
# equivalent of getbb("Warszawa") command from OSM::
bbox<-st_bbox(region)       # bounding box over the contour shape
bbox
bb.m<-matrix(0, nrow=2, ncol=2)
colnames(bb.m)<-c("min", "max")
rownames(bb.m)<-c("x", "y")
bb.m[1,1]<-bb[1]
bb.m[1,2]<-bb[3]
bb.m[2,1]<-bb[2]
bb.m[2,2]<-bb[4]
bb.m

#       min      max
#x 19.25921 23.12841
#y 51.01311 53.48181

# downloading and saving data on roads
bb.m %>% opq() %>% add_osm_feature(key="highway", value="motorway") %>% osmdata_sf()-> road1

bb.m %>% opq() %>% add_osm_feature(key="highway", value="trunk") %>% osmdata_sf()-> road2

bb.m %>% opq() %>% add_osm_feature(key="highway", value="primary") %>% osmdata_sf()-> road3

bb.m %>% opq() %>% add_osm_feature(key="highway", value="secondary") %>% osmdata_sf()-> road4

st_write(st_geometry(road1$osm_lines), "road1.shp")
st_write(st_geometry(road2$osm_lines), "road2.shp")
st_write(st_geometry(road3$osm_lines), "road3.shp")
st_write(st_geometry(road4$osm_lines), "road4.shp")

# reading road data from shapefiles (previously saved) 
road1<-st_read("road1.shp")
road2<-st_read("road2.shp")
road3<-st_read("road3.shp")
road4<-st_read("road4.shp")

# plotting raster as previously, but with roads and cities overlay
plot(rast.differ, main="Difference in firms and population density", mar=c(3,3,3,3), axes=FALSE) # bottom, left, top and right 
plot(st_geometry(region), add=TRUE)

plot(st_geometry(road1), add=TRUE, col="steelblue1", lwd=4) 
plot(st_geometry(road2), add=TRUE, col="steelblue1", lwd=3) 
plot(st_geometry(road3), add=TRUE, col="steelblue1", lwd=2) 
plot(st_geometry(road4), add=TRUE, col="steelblue1", lwd=1) 

degAxis(1)  # geographic axis, longitude
degAxis(2)  # geographic axis, latitude

points(miasta[,6:7], pch=21, bg="black", cex=(((miasta$population)^0.5)^0.5)/10)

legend(23.3, 52, legend=c("small cities", "mid-size cities", "big cities"), pt.cex=c(1, 1.5, 2.5), bty="n", pch=21, cex=0.7)
legend(23.25, 52.75, legend=c("highways", "primary roads", "secondary roads"), lwd=c(3, 2, 1), col=c("steelblue1", "steelblue1", "steelblue1"),  bty="n", cex=0.7)

Figure 9: Comparison of rasterised density classes for population and business location with road and settlement network

QDC in smaller scale - for city of Warsaw

# selecting observations located within 25 km from a city center
firms.sel<-firms[firms$dist_core_25==1, 9:10]
nn<-50000
selected<-sample(1:dim(firms.sel)[1], size=nn, replace=FALSE)
firms.sel<-firms.sel[selected, ]
colnames(firms.sel)<-c("x", "y")

center<-c(21.016, 52.254)   # city center, reference point POI
center.df<-data.frame(x=21.016, y=52.254) # the same as data.frame

# plot the observations
plot(firms.sel, pch=21, bg="black", cex=0.3)
plot(st_geometry(Warszawa), add=TRUE, lwd=3, border="green2")
points(center.df, bg="red", cex=2, pch=21)
title(main="Location of firms in Warsaw
within a radius of 25 km from city centre", sub="50,000 observations (10% subsample)")

# QDC for a subsample
qdc.waw.firms<-qdc(firms.sel, 50000, nclust=3, k=50, eps=0.05)
qdc.waw.firms

colnames(qdc.waw.firms)<-c("X", "Y", "knndist1", "frnn1", "Total_distance_to_k_nearest_neighbours", "Number_of_points_around", "Cluster_ID", "outcome.set1")

ggplot(qdc.waw.firms, aes(x=Total_distance_to_k_nearest_neighbours, y=Number_of_points_around, color=Cluster_ID)) + geom_point() + annotate("text", x=5, y=2, label="high density
(x) (low x) – close to k nearest neighbours, 
(y) (high y) - many neighbours around", col="blue") + annotate("text", x=5, y=-0.6, label="low density
(x) (high x) – far to k nearest neighbours, 
(y) (low y) – only a few neighbours around", col="red") + annotate("text", x=4, y=0, label="medium density", col="darkgreen") 

ggplot(qdc.waw.firms, aes(x=X, y=Y, color=Cluster_ID)) + geom_point() + theme(legend.position="none")

Figure 10: QDC for city of Warsaw, Poland

QDC

Katarzyna Kopczewska

2025-04-03