Abstract

This technical report explores the viability of density-based spatial clustering, specifically the Density Based Spatial Clustering of Applications with Noise (DBSCAN), as a practical method for objectively delineating real estate market areas within assessor workflows in St. Tammany Parish, Louisiana. A market area is the geographic region from which demand originates and where competing properties are located (IAAO, 2013). Traditionally, market areas are defined using assessor judgement or fixed administrative boundaries, for convenience many times a subdivision is used, potentially missing true spatial market dynamics. This study addresses the research question: How viable is DBSCAN as a practical method for objectively delineating real estate market areas to improve assessor workflows within a Computer-Assisted Mass Appraisal (CAMA) system?

DBSCAN was implemented in the R statistical environment using 2020-2024 parcel real estate transaction data, which included sale prices, geographic coordinates, and relevant property characteristics. Parameters such as epsilon (ε) and minimum points (MinPts) were determined using assessor domain knowledge combined with exploratory analysis via k-distance plots. Preliminary analysis demonstrated DBSCAN effectively identified 85 clusters aligning closely with local market behaviors, capturing meaningful spatial variations and highlighting potential inaccuracies in traditional boundaries. Results also indicated DBSCAN’s sensitivity to parameter selection, evidenced by an average silhouette score of -0.02, underscoring the need for careful tuning to balance meaningful clusters and noise reduction.

This report demonstrates that DBSCAN offers assessors a viable, data-driven method for refining market area delineations, potentially increasing valuation accuracy and consistency within existing CAMA workflows.

1. Introduction

Accurate delineation of real estate market areas, the geographic regions from which demand originates and where competing properties are located, is crucial for ensuring fairness, consistency, and accuracy in property valuation processes (IAAO, 2013). Current practices employed by assessors frequently rely on subjective judgement or utilize static administrative boundaries, such as subdivision lines, city limits, or zip codes, to define market areas. These methods can unintentionally overlook the true spatial dynamics of property markets, leading to potential inaccuracies in valuation.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) provides an alternative approach for objectively delineating these market areas. DBSCAN dynamically identifies clusters of properties based on actual spatial patterns of sales activity, adjusting cluster boundaries according to the density and distribution of property transactions within a given area. Unlike traditional clustering methods such as k-means, DBSCAN does not require pre-specification of the number of clusters and is robust in identifying irregularly shaped clusters while effectively distinguishing noise or outliers from market groupings (Schubert et al., 2017).

This technical report investigates the viability and practical applicability of DBSCAN as an objective, data-driven method for defining real estate market areas within assessor workflows, specifically within Computer-Assisted Mass Appraisal (CAMA) systems. Using detailed parcel-level real estate transaction data from 2020-2024 (which encompasses the last reassessment cycle) in St. Tammany Parish, Louisiana, this study demonstrates DBSCAN’s capability to identify meaningful clusters that may reflect real world market dynamics more accurately than traditional methods. By clearly establishing the practical viability of DBSCAN, this report aims to provide assessors with a structured analytical approach capable of enhancing property valuation accuracy, consistency, and fairness.

The primary research question guiding this technical report is: How viable is DBSCAN as a practical method for objectively delineating real estate market areas to improve assessor workflows within a CAMA system?

2. Understanding DBSCAN

2.1 What is DBSCAN?

To effectively evaluate the practical viability of DBSCAN for real estate market delineation, it is important to first clearly understand the underlying concepts and mechanics of this clustering method. Density Based Spatial Clustering of Applications with Noise (DBSCAN) is an unsupervised clustering algorithm that groups together points in dense regions while marking outliers as noise. Unlike partition based methods like k-means, DBSCAN does not require the number of clusters to be pre-specified and is effective at identifying clusters of arbitrary shape (Ester, Kriegel, Sander, & Xu, 1996).

2.2 Mathematical and Computational Process of DBSCAN

DBSCAN identifies clusters based on density by classifying points into core points, border points, or noise using two parameters: ε (epsilon), which defines the radius of a neighborhood, and MinPts, the minimum number of points required to form a dense region. A point is a core point if at least MinPts points (including itself) exist within its ε-neighborhood. Border points fall within ε of a core point but do not have enough neighbors to be core points themselves, while noise points do not belong to any cluster (Ester, Kriegel, Sander, & Xu, 1996).

The algorithm begins by selecting an arbitrary point. If it is a core point, a new cluster is formed, and all density-reachable points (other core points within ε) are recursively added. Border points are included in the cluster but do not expand it. If a point is not reachable from any core point, it remains noise. This process continues until all points are classified. Computationally, DBSCAN typically runs in O(n log n) time with efficient spatial indexing, but can degrade to O(n²) in worst cases. Its ability to detect arbitrarily shaped clusters and ignore noise makes it particularly effective for spatial clustering applications (Ester et al., 1996; Schubert, Sander, Ester, Kriegel, & Xu, 2017).

3. History of DBSCAN

With DBSCAN’s foundational concepts established, exploring its historical development provides valuable context on its methodological evolution and highlights why it remains relevant today. DBSCAN was first introduced in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu as a method for discovering spatial clusters in large databases (Ester et al., 1996). It was developed as a response to the limitations of existing clustering methods, such as k-means, which struggle with non-spherical clusters and noise.

Over time, DBSCAN has gained widespread recognition due to its ability to handle noise. Schubert et al. (2017) revisited the algorithm, emphasizing its enduring relevance and addressing common misconceptions. They highlighted its practical advantages, including its deterministic nature when using a fixed distance metric and its efficiency in handling large spatial datasets.

One of DBSCAN’s major strengths is its adaptability to GIS applications. Since its inception, researchers have applied it to various domains, including real estate market delineation (Farhan & Murray, 2005), urban development (Han, 2005), and retail clustering (Orr, Stewart, Jackson, & White, 2022). Recent studies have explored its integration with spatial networks to refine clustering in urban environments (Yoshimura, Santi, Arias, Zheng, & Ratti, 2021).

4. Variants of DBSCAN

While DBSCAN remains highly effective in spatial clustering applications, several variants have emerged to overcome its limitations or to handle specific analytical challenges such as parameter sensitivity, variable density distributions, or high-dimensional data. Briefly reviewing these variants highlights DBSCAN’s flexibility and provides context for why the original DBSCAN method was selected for practical evaluation in this study. Among these variants, Ordering Points To Identify the Clustering Structure (OPTICS) represents one of the most significant extensions.

4.1 OPTICS (Ordering Points To Identify the Clustering Structure)

OPTICS, introduced by Ankerst, Breunig, Kriegel, and Sander (1999), is an extension of DBSCAN that overcomes the sensitivity of DBSCAN to the ε parameter. Unlike DBSCAN, which requires a fixed ε value to define density based neighborhoods, OPTICS generates a reachability plot, allowing clusters of varying densities to be extracted automatically. This adaptability makes OPTICS more effective for datasets with non-uniform density distributions. However, it is computationally more expensive than DBSCAN, with a worst-case complexity of O(n²), though indexing structures can improve performance (Ankerst et al., 1999).

4.2 HDBSCAN (Hierarchical DBSCAN)

HDBSCAN (Campello, Moulavi, & Sander, 2013) is another major refinement that eliminates the need to manually set ε. Instead, it builds a hierarchical tree of density based clusters and extracts stable clusters from this hierarchy. This approach allows for better handling of variable density regions and automatically determines the optimal number of clusters, making it well-suited for large-scale spatial analysis. Unlike DBSCAN, which assigns a fixed label to each point, HDBSCAN produces a soft clustering output where points have membership probabilities.

4.3 Other Notable Variants

Several other adaptations of DBSCAN have been introduced to address specific challenges:

IDBSCAN (Incremental DBSCAN): Designed for dynamic datasets, allowing efficient updates when new data points are added (Ester, Kriegel, & Xu, 1998).

DENCLUE (Density-Based Clustering Based on Density Distribution Functions): A probabilistic extension that estimates density functions rather than relying on fixed ε neighborhoods (Hinneburg & Keim, 1998).

ST-DBSCAN (Spatial-Temporal DBSCAN): Extends DBSCAN for spatiotemporal clustering, useful for tracking moving objects or evolving spatial patterns (Birant & Kut, 2007).

4.4 Relevance of DBSCAN Variants to Real Estate Market Analysis

In the context of real estate market delineation, OPTICS and HDBSCAN offer advantages over traditional DBSCAN when dealing with variable-density urban areas. These methods help assess market heterogeneity, where clusters may not have uniform density due to differences in property value, zoning regulations, or economic activity. For example, HDBSCAN’s hierarchical clustering approach could provide a more nuanced classification of market areas, while OPTICS can reveal gradual transitions between clusters rather than imposing hard boundaries.

5. Literature Review

Having explored DBSCAN’s conceptual and historical contexts, this literature review specifically examines practical applications of density-based clustering relevant to real estate and geographic market delineation. Density-based spatial clustering methods, particularly DBSCAN, have been increasingly applied to geographic analysis and real estate market delineation due to their practical strengths. DBSCAN is especially valued for its ability to clearly distinguish meaningful geographic clusters from noise, effectively manage clusters of varying densities, and dynamically adjust boundaries based on actual spatial data distributions (Schubert et al., 2017).

In practical urban planning and real estate contexts, spatial clustering techniques have demonstrated clear advantages over traditional methods. For example, GIS-based density clustering was used effectively to delineate market areas for park-and-ride facilities, resulting in more accurate and responsive market definitions compared to traditional methods reliant on administrative boundaries (Farhan & Murray, 2005). Similarly, density based clustering techniques were applied successfully to analyze urban developments characterized by multiple activity centers, providing improved insights into spatial patterns of property valuation in comparison to conventional static methods (Han, 2005).

Recent practical applications also include analyzing how urban street networks influence retail sales patterns using density-based spatial clustering. Such studies have shown the value of DBSCAN and similar methods in capturing nuanced patterns of economic activity that are often overlooked by traditional analysis relying solely on fixed boundary definitions (Yoshimura et al., 2021).

While previous studies have successfully demonstrated DBSCAN’s strengths in urban analysis, limited attention has been explicitly given to integrating DBSCAN within assessor workflows and Computer-Assisted Mass Appraisal (CAMA) systems. This study directly addresses this practical gap, emphasizing DBSCAN’s real-world applicability for assessors. In doing so, this report provides a practical contribution specifically tailored to assessors, demonstrating DBSCAN’s potential value for improving the accuracy and consistency of market area delineation within current valuation practices.

6. Data

This study utilizes two primary datasets provided by the St. Tammany Parish Assessor’s Office, covering detailed property records and real estate transactions from 2020 to 2024. These datasets provide comprehensive parcel-level detail necessary for spatial clustering and market delineation.

The first dataset is the Parcel Spatial Data, provided in shapefile format, which contains precise geographic boundaries and parcel centroid coordinates. This dataset serves as the foundational spatial reference for the analysis, facilitating accurate geographic representation and spatial clustering.

The second primary dataset is the Real Estate Transaction Data, extracted from the assessor’s SQL database. It includes detailed records of property transactions between 2020 and 2024, encompassing sale price, transaction date, year built, total square footage, and depreciation year. 2024 was the last reassessment cycle for the Assessors, a 4 year assessment cycle dictated by the Louisiana State Constitution. The sale price is critical for assessing property valuation trends, transaction dates provide temporal context for market analysis, and characteristics such as property age (year built and depreciation year) and square footage facilitate refinement and validation of market area clusters.

Both datasets include geographic coordinates (latitude and longitude) necessary for spatial clustering analysis, enabling DBSCAN to objectively identify meaningful market clusters based on spatial proximity and property characteristics.

However, some limitations impacted the completeness of the transaction data at the time of analysis. Specifically, certain transaction details such as precise sale dates and historical transaction records were incomplete or temporarily inaccessible due to technical constraints with the assessor’s database. Additionally, assessor staff expressed differing professional opinions regarding appropriate DBSCAN parameter choices (Epsilon and MinPts), reflecting uncertainty about the precise optimal parameter values. These limitations are important considerations for the interpretability and accuracy of preliminary clustering results presented in this report.

To clearly summarize, Table 1 below outlines each dataset, their key attributes, and the primary role they play in the analysis:

Table 1: Summary of Primary Datasets and Attributes

Dataset Attributes Role in Analysis
Parcel Spatial Data (Shapefile) Parcel boundaries, centroid coordinates (latitude/longitude) Spatial reference; DBSCAN clustering basis
Real Estate Transaction Data Sale price, transaction date, year built, square footage, depreciation year, geographic coordinates Market valuation analysis; attributes for validating and refining spatial clusters

Methodology

##7.1 Overview of DBSCAN

DBSCAN identifies core, border, and noise points based on neighborhood density. This method is particularly effective for spatial clustering applications because it does not require a predefined number of clusters and can clearly distinguish densely clustered spatial areas from noise (Schubert et al., 2017). Two critical parameters define DBSCAN clusters: epsilon (ε), the maximum radius of the neighborhood around a point, and minimum points (MinPts), the minimum number of points required within that radius for a point to qualify as a core point.

##7.2 Methodological Steps The methodological workflow clearly consisted of three main stages: Data Preprocessing, Clustering Implementation, and Evaluation and Validation. A visual summary of this workflow is shown in Figure 1 below.

Methodology Flowchart

Figure 1: Methodological Workflow for DBSCAN Analysis

###7.2.1 Data Preprocessing

The initial data preprocessing stage involved several key steps, clearly described in detail in the Python preprocessing script provided in Appendix A. Numerical attributes such as sale price and square footage were normalized to facilitate comparison across varying scales. Parcel centroids were generated explicitly using ArcPy’s FeatureToPoint method, providing clear spatial reference points for clustering analysis (see Appendix A for the complete Python processing script). Duplicate records and parcels with null geometries were systematically removed to maintain data quality, as explicitly outlined in the script (Appendix A). The final processed dataset included populated coordinate fields (latitude and longitude), exported clearly into CSV and shapefile formats, directly facilitating integration with subsequent clustering analysis conducted in R.

library(sf)
library(dbscan)
library(ggplot2)
library(dplyr)
library(tmap)
library(cluster)
library(Rtsne)
library(parallelDist)
library(RANN)
library(patchwork)

# Load parcel centroids
data_path <- "M:/SCHOOL/PennState/GEO_588_2025/TERM_PROJECT/DATA/PARCELS_2024_Centroid.shp"
parcels_centroids <- st_read(data_path)
## Reading layer `PARCELS_2024_Centroid' from data source 
##   `M:\SCHOOL\PennState\GEO_588_2025\TERM_PROJECT\DATA\PARCELS_2024_Centroid.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 131996 features and 9 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 3620122 ymin: 609695.6 xmax: 3836715 ymax: 805091.4
## Projected CRS: NAD83 / Louisiana South (ftUS)
# Check coordinate reference system
print(st_crs(parcels_centroids))
## Coordinate Reference System:
##   User input: NAD83 / Louisiana South (ftUS) 
##   wkt:
## PROJCRS["NAD83 / Louisiana South (ftUS)",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["SPCS83 Louisiana South zone (US survey foot)",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",28.5,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-91.3333333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",30.7,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",29.3,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",3280833.3333,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",0,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United States (USA) - Louisiana - counties of Acadia; Allen; Ascension; Assumption; Beauregard; Calcasieu; Cameron; East Baton Rouge; East Feliciana; Evangeline; Iberia; Iberville; Jefferson; Jefferson Davis; Lafayette; LaFourche; Livingston; Orleans; Plaquemines; Pointe Coupee; St Bernard; St Charles; St Helena; St James; St John the Baptist; St Landry; St Martin; St Mary; St Tammany; Tangipahoa; Terrebonne; Vermilion; Washington; West Baton Rouge; West Feliciana."],
##         BBOX[28.85,-93.94,31.07,-88.75]],
##     ID["EPSG",3452]]
# Extract coordinates for later visualization
coords <- st_coordinates(parcels_centroids)
coords_df <- as.data.frame(coords)

# Load sales data
sales_data <- read.csv("M:/SCHOOL/PennState/GEO_588_2025/TERM_PROJECT/DATA/real_estate_sales.csv")

# Ensure REID is character in both datasets
sales_data$REID <- as.character(sales_data$REID)
parcels_centroids$REID <- as.character(parcels_centroids$REID)

# Join datasets with the sf object first to preserve geometry
merged_data <- inner_join(parcels_centroids, sales_data, by = "REID")

# Normalize Revenue (sale price)
merged_data$REVENUE_NORM <- as.numeric(scale(merged_data$REVENUE))

# Filter to complete cases for clustering (ensure coordinate fields and footprint area exist)
merged_data_complete <- merged_data %>% 
  filter(complete.cases(POINT_X, POINT_Y, MB_FOOTPRINT_AREA, REVENUE_NORM))

###7.2.2 Clustering Implementation

Given our analysis explicitly uses four clustering dimensions, longitude, latitude, property footprint area, and normalized revenue, we followed the recommendation by Schubert et al. (2017) to set MinPts explicitly as twice the number of dimensions (MinPts = 8). This heuristic is broadly accepted in spatial clustering literature because it helps ensure that the identified clusters are stable, meaningful, and resistant to sensitivity from isolated points or minor data variations.

# Select and convert clustering attributes to numeric
clustering_data <- merged_data_complete %>% 
  st_drop_geometry() %>% 
  select(POINT_X, POINT_Y, MB_FOOTPRINT_AREA, REVENUE_NORM) %>%
  mutate(across(everything(), as.numeric))

# Set MinPts (k) as twice the number of dimensions
k <- 2 * ncol(clustering_data)
cat("Selected MinPts (k):", k, "\n")
## Selected MinPts (k): 8
# Compute and sort k-nearest neighbor distances
distances <- nn2(clustering_data, k = k)$nn.dists[, k]
sorted_distances <- sort(distances, decreasing = TRUE)

# Generate k-distance plot for epsilon selection
ggplot(data.frame(Index = 1:length(sorted_distances), Distance = sorted_distances), 
       aes(x = Index, y = Distance)) +
  geom_line(color = "blue") +
  geom_point(color = "red", size = 0.5) +
  labs(title = "Figure 2: k-Distance Plot", 
       x = "Points Sorted by Distance", 
       y = paste(k, "th Nearest Neighbor Distance")) +
  theme_minimal()

# Suggest epsilon values based on quantiles
eps_85 <- quantile(sorted_distances, 0.85)
eps_90 <- quantile(sorted_distances, 0.90)

cat("Suggested ε (85th percentile):", round(eps_85, 2), "\n")
## Suggested ε (85th percentile): 1604.69
cat("Suggested ε (90th percentile):", round(eps_90, 2), "\n")
## Suggested ε (90th percentile): 2067.48

Based on the k-distance plot (Figure 2), exploratory analysis suggests considering epsilon values around the 85th and 90th percentiles. The final epsilon choice will be informed by further validation through assessor expertise and preliminary clustering results.

# Apply DBSCAN with selected parameters
set.seed(123)
optimal_eps <- eps_85  
min_points <- k

# Run DBSCAN algorithm
clustering_results <- dbscan(clustering_data, eps = optimal_eps, minPts = min_points)

# Append cluster labels to the merged dataset
merged_data_complete$cluster <- as.factor(clustering_results$cluster)

# Output cluster statistics
cat("Total clusters identified:", length(unique(clustering_results$cluster)) - 1, "\n")
## Total clusters identified: 85
cat("Noise points:", sum(clustering_results$cluster == 0), "\n")
## Noise points: 1346

##7.3 Evaluation and Validation

The DBSCAN clusters are evaluated using both qualitative and quantitative approaches.

Qualitative Validation: Visual assessments were made using GIS tools to compare DBSCAN clusters with assessor-defined market areas.

Quantitative Validation: Silhouette scores, calculated on a random subset of non-noise points (up to 5,000), provide a numeric measure of cluster cohesion and separation.

# Silhouette Score Calculation
set.seed(123)
non_noise_indices <- which(clustering_results$cluster != 0)
sample_size <- min(5000, length(non_noise_indices))
sample_indices <- sample(non_noise_indices, sample_size)

sample_data <- clustering_data[sample_indices, ]
sample_clusters <- clustering_results$cluster[sample_indices]

silhouette_values <- silhouette(sample_clusters, dist(sample_data))
average_silhouette <- mean(silhouette_values[, 3])

cat("Average Silhouette Score:", round(average_silhouette, 3), "\n")
## Average Silhouette Score: -0.02

##7.4 Clustering Visualization

DBSCAN cluster results were visualized explicitly to facilitate easy interpretation of market area delineation.

Parcel Clusters: Figure 3 displays the spatial distribution of clusters, highlighting clear boundaries and delineations that emerged from the DBSCAN analysis.

ggplot(merged_data_complete) +
  geom_sf(aes(color = cluster), size = 0.2, show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Figure 3: DBSCAN Clustering") +
  theme(legend.position = "none")

Convex Hull of Clusters: Figure 4 provides additional clarity by presenting convex hulls around each cluster, clearly showing cluster boundaries and facilitating comparisons with assessor-defined market areas.

clusters_sf <- merged_data_complete %>%
  filter(cluster != 0) %>%
  group_by(cluster) %>%
  summarize(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON") %>%
  mutate(geometry = st_convex_hull(geometry))

ggplot() +
  geom_sf(data = merged_data_complete, aes(color = cluster), size = 0.2, alpha = 0.7, show.legend = FALSE) +
  geom_sf(data = clusters_sf, fill = NA, color = "black", size = 0.5) +
  theme_minimal() +
  labs(title = "Figure 4: DBSCAN Clusters with Convex Hulls") +
  theme(legend.position = "none")

Density Heatmap: The density heatmap shown in Figure 5 further supports visual interpretation by identifying high-density clusters and potential hotspots within the study area.

ggplot(coords_df, aes(x = X, y = Y)) +
  stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", contour = TRUE, show.legend = FALSE) +
  scale_fill_viridis_c(guide = "none") +
  theme_minimal() +
  labs(title = "Figure 5: Density Heatmap of Parcel Centroids", x = "Easting", y = "Northing")

##7.5 Sample of Variants

Exploring two popular variants of DBSCAN.

Figure 6: OPTICS Reachability Plot illustrating the reachability distances for the dataset, which aids in identifying suitable epsilon values for DBSCAN clustering.

# 1. OPTICS -----------------------------------------------------------------
# "optics()" computes an ordering rather than final cluster labels.

# Adjust minPts as appropriate (e.g., 10, 20, etc.).
optics_result <- optics(clustering_data, minPts = 10)

# Visualize the reachability plot.
plot(optics_result, 
     main = "Figure 6: OPTICS Reachability Plot")

# Convert the OPTICS result into DBSCAN-style clusters by specifying eps:
optics_clusters <- extractDBSCAN(optics_result, eps_cl = 0.5)

# Store cluster labels in the sf data frame
merged_data_complete$cluster_optics <- as.factor(optics_clusters$cluster)

# 2. HDBSCAN ----------------------------------------------------------------
# HDBSCAN automatically identifies clusters of varying density; 
# it only needs minPts (similar to DBSCAN).

hdbscan_result <- hdbscan(clustering_data, minPts = 10)

# hdbscan_result$cluster gives the cluster labels (0 = noise/outlier).
merged_data_complete$cluster_hdbscan <- as.factor(hdbscan_result$cluster)

# 3. Visualization ----------------------------------------------------------
# Side-by-side maps to compare OPTICS and HDBSCAN clusters.

p_optics <- ggplot(merged_data_complete) +
  geom_sf(aes(color = cluster_optics), size = 0.2, show.legend = FALSE) +
  theme_minimal() +
  labs(title = "OPTICS Clusters")

p_hdbscan <- ggplot(merged_data_complete) +
  geom_sf(aes(color = cluster_hdbscan), size = 0.2, show.legend = FALSE) +
  theme_minimal() +
  labs(title = "HDBSCAN Clusters")

# Combine both plots side by side and add an overall title.
combined_plot <- p_optics + p_hdbscan +
  plot_annotation(title = "Figure 7: Side-by-Side Comparison")

combined_plot

8.Results and Challenges

Results

The implementation of DBSCAN using 2024 parcel-level transaction data identified 85 distinct clusters with 1,346 points classified as noise, aligning notably well with localized market knowledge and real-world property market behaviors. Visual analyses, including GIS-based mapping, density heatmaps, and convex hull visualizations, confirmed that DBSCAN successfully captured spatial patterns that differ significantly from traditional administrative boundaries such as subdivision or municipal lines. Notably, clusters delineated by DBSCAN demonstrated irregular and realistic shapes, reflecting genuine market interactions rather than arbitrary boundaries.

Quantitative evaluation through silhouette scores, however, presented initial challenges. An average silhouette score of -0.02 indicated considerable overlap among clusters and areas where spatial coherence could be improved. This negative silhouette score underscores the sensitivity of DBSCAN to parameter choices, particularly epsilon (ε) and the minimum number of points (MinPts).

Challenges

One significant challenge encountered was the need for multiple iterations of parameter tuning to optimize the DBSCAN results. The parameters epsilon and MinPts strongly influence clustering outcomes, and inappropriate selections can either overly generalize clusters or excessively fragment them into noise points. Although domain knowledge from assessor staff guided initial parameter selection, iterative exploratory analysis via k-distance plots became crucial to refining these parameters effectively. Future implementations should anticipate multiple parameter testing iterations and consider automated or semi-automated approaches to tuning, such as grid searches, to systematically identify optimal parameter ranges.

Data preparation also posed notable challenges. The initial dataset included parcels with non-residential usage or abnormal transaction behaviors, which can significantly distort clustering outcomes. Comprehensive data filtering and normalization, including more rigorous outlier detection and removal procedures, are recommended for future analyses. Ensuring data quality is paramount, as clustering accuracy directly depends on the integrity and representativeness of the input data.

Additionally, the practical utility of DBSCAN derived market areas in assessor workflows hinges significantly on further validation. This includes comparative analyses with existing market areas delineated through traditional methods. Conducting ratio studies comparing assessment-to-sale price ratios within DBSCAN derived clusters versus traditionally defined market areas would provide concrete evidence regarding the viability and effectiveness of DBSCAN for integration into Computer-Assisted Mass Appraisal (CAMA) systems. Without these comparative assessments, the practical validation of DBSCAN remains incomplete.

Overall, preliminary results affirm the potential of DBSCAN as a valuable tool for objectively delineating real estate market areas, highlighting its practical relevance for valuation purposes. However, practical implementation within assessor workflows necessitates careful attention to parameter optimization, data quality management, and thorough validation through comparative market analyses and ratio studies.

9.Conclusion

This study has demonstrated that DBSCAN is a robust and viable method for objectively delineating real estate market areas, offering a potential improvement over traditional subjective and static boundary approaches. By dynamically identifying clusters based on actual transaction patterns, DBSCAN effectively captures nuanced spatial market dynamics. Despite these strengths, careful parameter optimization and rigorous data quality management are essential for accurate and reliable clustering outcomes.

Future research should focus on refining DBSCAN cluster definitions by incorporating additional relevant property characteristics and systematically optimizing parameter selections. Comprehensive validation through comparative studies, including ratio analysis between DBSCAN-derived market areas and existing assessor-defined areas, is crucial to confirm the practical applicability and benefits of DBSCAN within assessor workflows and CAMA systems. Ultimately, integrating DBSCAN into valuation practices has the potential to significantly enhance accuracy, consistency, and fairness in property assessments.

10.References

Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record, 28(2), 49-60. https://doi.org/10.1145/304182.304187

Birant, D., & Kut, A. (2007). ST-DBSCAN: An algorithm for clustering spatial–temporal data. Data & Knowledge Engineering, 60(1), 208-221. https://doi.org/10.1016/j.datak.2006.01.013

Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. Advances in Knowledge Discovery and Data Mining, 160-172. https://doi.org/10.1007/978-3-642-37456-2_14

Ester, M., Kriegel, H. P., & Xu, X. (1998). Incremental clustering for mining in a data warehousing environment. Proceedings of the 24th International Conference on Very Large Data Bases, 323-333.

Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 42, 3, Article 19 (September 2017), 21 pages. https://doi.org/10.1145/3068335

Farhan, B. and Murray, A.T. (2005), A GIS-Based Approach for Delineating Market Areas for Park and Ride Facilities. Transactions in GIS, 9: 91-108. https://doi-org.ezaccess.libraries.psu.edu/10.1111/j.1467-9671.2005.00208.x

Han, S. S. (2005). Polycentric Urban Development and Spatial Clustering of Condominium Property Values: Singapore in the 1990s. Environment and Planning A: Economy and Space, 37(3), 463-481. https://doi-org.ezaccess.libraries.psu.edu/10.1068/a3746

Hinneburg, A., & Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 58-65.

International Association of Assessing Officers (IAAO). (2013). Standard on Mass Appraisal of Real Property. International Association of Assessing Officers.

Orr, A. M., Stewart, J. L., Jackson, C. C., & White, J. T. (2022). Shifting prime retailing pitches. A GIS analysis of the spatial adaptations in city centre retail markets. Journal of Property Research, 40(2), 101–133. https://doi.org/10.1080/09599916.2022.2141133

Yoshimura, Y., Santi, P., Arias, J. M., Zheng, S., & Ratti, C. (2021). Spatial clustering: Influence of urban street networks on retail sales volumes. Environment and Planning B: Urban Analytics and City Science, 48(7). https://doi.org/10.1177/2399808320954210

Appendix A: Parcel Data Processing Script

This appendix provides the full Python script used for processing parcel data and preparing it for spatial clustering analysis using DBSCAN. The script performs centroid extraction, spatial transformations, duplicate removal, and data export.

Name: PARCEL DATA PROCESSING PROJECT

Purpose: PREPARE PARCEL DATA FOR EXPLORATORY ANALYSIS IN R STUDIO USING DBSCAN

Folder: M:_588_2025_PROJECT

Author: zwhite

Created: 03/05/2025

Import necessary Python libraries

import arcpy import os import time import csv

Define file paths for the database and output locations

gdb_path = r”M:_588_2025_PROJECT_588_PARCEL.gdb” parcel_fc = os.path.join(gdb_path, “PARCELS_2024”) output_folder = r”M:_588_2025_PROJECT”

Define feature classes for intermediate and final outputs

centroid_fc = os.path.join(gdb_path, “PARCELS_2024_Centroid”) cleaned_centroid_fc = os.path.join(gdb_path, “PARCELS_2024_Centroid_Clean”)

Set target spatial reference

target_sr = arcpy.SpatialReference(“NAD_1983_StatePlane_Louisiana_South_FIPS_1702_Feet”)

Start timing the process

start_time = time.time()

try: # Check and print the spatial reference of the input dataset input_sr = arcpy.Describe(parcel_fc).spatialReference print(f”Input Spatial Reference: {input_sr.name}“)

# Ensure feature classes do not already exist
for fc in [centroid_fc, cleaned_centroid_fc]:
    if arcpy.Exists(fc):
        arcpy.Delete_management(fc)
        print(f"Deleted existing feature class: {fc}")

# Step 1: Create centroids from parcel polygons
arcpy.FeatureToPoint_management(parcel_fc, centroid_fc, "INSIDE")
print("Centroid feature class created.")

# Step 2: Reproject if necessary
if input_sr.factoryCode != target_sr.factoryCode:
    print("Input data is in a different projection. Reprojecting...")
    projected_centroid_fc = os.path.join(gdb_path, "PARCELS_2024_Centroid_Projected")
    if arcpy.Exists(projected_centroid_fc):
        arcpy.Delete_management(projected_centroid_fc)

    arcpy.Project_management(centroid_fc, projected_centroid_fc, target_sr)
    print("Projection to NAD_1983_StatePlane_Louisiana_South_FIPS_1702_Feet complete.")
    processing_fc = projected_centroid_fc
else:
    print("Input data is already in the correct projection. Skipping projection step.")
    processing_fc = centroid_fc

# Step 3: Remove duplicate records based on geometry
arcpy.DeleteIdentical_management(processing_fc, ["SHAPE"])
print("Duplicate records removed.")

# Step 4: Remove null geometries
arcpy.FeatureClassToFeatureClass_conversion(processing_fc, gdb_path, "PARCELS_2024_Centroid_Clean", "SHAPE IS NOT NULL")
print("Records with null geometry removed.")

# Step 5: Add fields for x and y coordinates
arcpy.AddField_management(cleaned_centroid_fc, "POINT_X", "DOUBLE")
arcpy.AddField_management(cleaned_centroid_fc, "POINT_Y", "DOUBLE")
print("Coordinate fields POINT_X and POINT_Y added.")

# Step 6: Populate coordinate fields
with arcpy.da.UpdateCursor(cleaned_centroid_fc, ["SHAPE@XY", "POINT_X", "POINT_Y"]) as cursor:
    for row in cursor:
        if row[0] is not None:
            x, y = row[0]
            row[1] = x
            row[2] = y
            cursor.updateRow(row)
print("Coordinate fields updated with centroid values.")

# Step 7: Export the cleaned centroid data to a CSV file
field_names = [f.name for f in arcpy.ListFields(cleaned_centroid_fc) if f.type != 'Geometry']
csv_file = os.path.join(output_folder, "PARCELS_2024_Centroid.csv")
with open(csv_file, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(field_names)  # Write header row with all field names
    with arcpy.da.SearchCursor(cleaned_centroid_fc, field_names) as cursor:
        for row in cursor:
            writer.writerow(row)
print("CSV export completed:", csv_file)

# Step 8: Export the cleaned centroid feature class as a shapefile
shapefile_path = os.path.join(output_folder, "PARCELS_2024_Centroid.shp")
arcpy.CopyFeatures_management(cleaned_centroid_fc, shapefile_path)
print("Shapefile export completed:", shapefile_path)

except arcpy.ExecuteError: print(“ArcPy error:”, arcpy.GetMessages(2)) except Exception as e: print(“General error:”, str(e))

End timing the process

end_time = time.time() print(“Process completed in {:.2f} seconds.”.format(end_time - start_time))