Fraud is a major problem in the digital advertising industry, costing tens of billions of dollars in wasted ad spend every year. One sign of ad fraud is advertising technology (adtech) data that contains suspicious geographic patterns. Because of the complexity of advertising networks, ad fraud can be difficult to detect. This paper performed exploratory spatial data analysis on fraudulent and non-fraudulent adtech-like data in Beijing, China in an effort to understand whether the presence of suspicious geographic patterns could be detected to support the removal of fraudulent traffic from advertising networks. This paper explored three types of suspicious geographic patterns seen in ad-tech data: gridding (events appear aligned to an invisible grid), location spoofing (events occur at implausibly dense clusters), and centroid bias (events are fixed around a center and density decreases away from the center).
This paper explores whether fraudulent and non-fraudulent adtech-like datasets differ in spatial clustering, dispersion, or regularity, and whether there are differences in hotspot structures, density surfaces, nearest-neighbor behavior, or spatial autocorrelation among the four datasets. After introducing the topic and explaining the data derivation process, the paper explores methods in exploratory spatial data analysis, including: basic dataset summaries, point pattern analysis, spatial autocorrelation, and hotspot analysis. The results of the analysis found that all datasets exhibited spatial clustering and deviated from complete spatial randomness. The gridded dataset appeared more dispersed than the baseline dataset. The location spoofing dataset showed major differences in density surfaces and hot spot structures. And the centroid bias dataset showed higher indicators of spatial randomness as well as major differences in nearest neighbor analysis and local Moran’s I/local indicators of spatial autocorrelation (LISA). This paper concludes with a reflection of the project and suggestions for future research.
In the digital marketing or advertising technology (adtech) ecosystem, there is a process for delivering the right ad to the right person at the right time. This process is called real-time bidding (RTB), and it is designed to automate the buying, selling, and delivering of digital ads in a manner that is optimized for both buyers and sellers in the digital marketing ecosystem. (Baker & Hostetler LLP, 2023). The speed and scale of this process is quite impressive, but it is not without challenges. Specifically, it is estimated that about 20% of all advertising traffic is fraudulent, costing marketers more than $100 million per day and from $33 billion to $100 billion per year (HUMAN Security, n.d.). Ad fraud is the deliberate practice of manipulating metrics or data in order to steal money from advertising budgets, and it’s a huge problem for both advertisers and advertising networks. However, due to the scale and complexity of advertising networks, ad fraud can be very difficult to detect.
This project explored suspicious geographic patterns in adtech data – a sign of ad fraud – for the purpose of filtering out fraudulent data from advertising networks. RTB data often includes a user’s geolocation in order to serve the user the “best” ad. The geolocation can be derived from a number of possible location-based services (i.e. GPS, WiFi positioning system, cellular triangulation, IP address lookup, etc.). And while suspicious geographic patterns may be obvious to the human eye, they may not be as obvious to a machine or advertising network. Three types of suspicious geographic patterns found in RTB include: gridding (where events appear to be snapped to an invisible grid), location spoofing (where many devices appear at the same location in a manner that is not plausible), and centroid bias (where events appear clustered around a center and less dense the further away from that center) (Mader, 2025).
This project sought to uncover whether suspicious geographic patterns could be detected programmatically and at scale using spatial data science techniques. The research questions that this study addressed were: 1) Do fraudulent and non-fraudulent RTB datasets differ in spatial clustering, dispersion, or regularity? 2) Are there differences in hotspot structures, density surfaces, nearest-neighbor behavior, or spatial autocorrelation?
A major limitation of this study was the lack of publicly available RTB data, which was likely due to privacy reasons and possibly financial reasons. The most widely, if not only, publicly available RTB dataset was the 2013 iPinYou RTB dataset, which was 35GB and contained 20 million rows of data (Liao et al, 2014). Unfortunately, the geolocation data contained therein did not include coordinates - only naming the county group or metro region instead. This level of granularity was insufficient to use as a baseline against suspicious geographic patterns. Instead of manipulating this dataset, an RTB-like dataset that mimicked human behavior at scale – the GeoLife dataset by Microsoft – was selected. A summary of the datasets used for this project is provided in Table 1.
| Dataset Name | Dataset Description | Source |
|---|---|---|
| Microsoft’s GeoLife GPS trajectory dataset | This GPS trajectory dataset was collected in (Microsoft Research) Geolife project by 178 users in a period of over four years (from April 2007 to October 2011) and contains about 24 million points. | https://www.kaggle.com/datasets/arashnic/microsoft-geolife-gps-trajectory-dataset |
| Baseline dataset | A sample of the above dataset within the study area (Beijing) containing 100,000 points. | Derived from Microsoft’s GeoLife dataset |
| 20% Gridded dataset | This dataset was an adaptation of the baseline dataset, where 20% of the data was rounded to the nearest 0.01 degree. | Author created |
| 20% Location Spoofing dataset | This dataset was an adaptation of the baseline dataset, where 20% of the data was reassigned to 5 seed geolocations within the study area and jittered within a tiny bounding box (11m x 11m). | Author created |
| 20% Centroid Bias dataset | This dataset was an adaptation of the baseline dataset, where 20% of the data was reassigned to a location surrounding a fixed seed coordinate | Author created |
Microsoft’s GeoLife GPS trajectory dataset was 2GB and contained about 24 million points from 178 users over 3 years - with a heavy concentration in Beijing, China. Like RTB, this dataset was high volume, contained individual points, was derived from mobile device location-based services, and showed genuine human movement patterns. To prepare this dataset for analysis, a bounding box was used around Beijing as the study area equal to 1 degree of latitude and 1 degree of longitude (from 39.6° to 40.6° North and 116° to 117° East) or about 111km by 85km. Due to the large volume of data, it was processed via a Python script in a Jupyter Notebook in a Google Colab environment which had the benefit of additional RAM. Looping through each of the 20,000 files took about two hours, and only latitudes and longitudes within the study area were selected. Instead of saving the data to memory, it was immediately written or appended to a single csv file. After de-duplicating the data, the csv file was 385MB and contained 17,647,543 rows. From this, a 100,000 point sample of the data was selected t o be the baseline dataset, which ended up being about 2MB and was less computationally intense to work with.
In addition to the RTB-like dataset, the project required fraud-like datasets. From the GeoLife-derived baseline dataset, three fraud-laden datasets were derived, each containing a suspicious geographic pattern (See Figure 1). For each new dataset, 20% of the data was fraudulent and 80% of the data was from the baseline. This 80/20 ratio represented a more realistic scenario where fraud data is embedded within valid data. The three aforementioned suspicious geographic patterns were: gridding, location spoofing, and centroid bias.
To create the data for gridding, 20% of the baseline points were randomly selected and rounded them to the nearest .01 degree. At the right scale, this process gave the appearance of data being snapped to an invisible grid. This spatial pattern contained points aligned to a grid.
To create data for location spoofing, 20% of the baseline data were randomly selected and reassigned to one of five seed locations within the study area, and then jitter was added to each coordinate within a very small bounding box (11m x 11m). The five seed locations were chosen based on IP address lookups within the study area, including Beijing (39.9075, 116.3971) as well as Xicheng District (39.89968, 116.3345), Chaoyang District (39.9042, 116.4070), Haidian District (39.9593, 116.2981), and Fengtai District (39.8585, 116.2863) – major cities within Beijing. Exactly 50% of the fraudulent data was assigned to the Beijing seed location, simulating a dominant fraud source, and the remaining 50% of the fraudulent data were evenly distributed across the other four seeds. This spatial pattern contained very small, extremely dense clusters.
To create data for centroid bias, 20% of the baseline data were randomly selected and reassigned to a location surrounding a fixed central coordinate (Beijing: 39.9075, 116.3971) and constrained within a random direction and distance (up to 40 km away). The points were biased toward shorter distances by setting a point’s distance from the center equal to the max distance (40,000m) multiplied by (a random number between 0.0 -1.0)^squared. By squaring the random number, distances from the centerpoint became smaller, creating a centroid bias. This spatial pattern contained a centroid bias effect where points appeared clustered around a fixed center, but became less dense the further away from the center.
Figure 1. Data Plots. The plot of the baseline dataset (A) containing 100% valid data (blue) is compared against three plots of datasets containing 80% valid data and 20% fraudulent data (red) with suspicious geographic patterns, including gridded fraud (B), location spoofing (C), and centroid bias (D).
It’s worth noting that most of the attributes of the data had been removed, and the only attributes of the data points that remained were latitude, longitude, and label (ex. baseline, gridded_fraud, location_spoofing, or centroid_bias). There was no temporal attribute, and duplicate points were removed before taking the 100,000 sample points. These characteristics limited how the data could be used, and precluded any meaningful analysis related to frequency or time. Overall, these datasets were optimized for analysis of the spatial structure of suspicious geographic patterns that appear in adtech data.
Figure 2 provides an overview of the data preparation and methodology.
Figure 2. Flow Chart describing data preparation and methodology.
First, the data was prepared for analysis. Necessary packages were loaded into the environment, and file paths were set. The data was then loaded into the environment using the readr package from tidyverse.
Second, the coordinate data was converted into spatial features with geometries using the sf package.
# 2. Convert to sf objects ----------------------------------------------
make_sf <- function(df) {
st_as_sf(
df,
coords = c("lon", "lat"),
crs = 4326,
remove = FALSE
)
}
datasets_sf <- map(datasets, make_sf)
baseline_sf <- datasets_sf$Baseline
gridded_sf <- datasets_sf$Gridded
spoofing_sf <- datasets_sf$Spoofing
centroid_sf <- datasets_sf$Centroid
Third, the data was transformed into a projected coordinate system using the st_transform() function from sf in preparation for distance-based analysis. UTM Zone 50N (EPSG:32650) was selected as the projected coordinate system because it is a standard projected coordinate system that uses WGS 84 data for mapping Beijing, China (Spatial Reference, n.d.).
# 3. Project data to UTM Zone 50N ---------------------------------------
datasets_proj <- map(datasets_sf, ~ st_transform(.x, 32650))
baseline_proj <- datasets_proj$Baseline
gridded_proj <- datasets_proj$Gridded
spoofing_proj <- datasets_proj$Spoofing
centroid_proj <- datasets_proj$Centroid
Fourth, a study window was created using the spatstat package to apply to each dataset to ensure that differences in clustering were due to spatial patterns and not differences in geographic extent.
# 4. Create study area window for spatstat -------------------------------
study_area_wgs84 <- st_as_sfc(
st_bbox(
c(xmin = 116, xmax = 117, ymin = 39.6, ymax = 40.6),
crs = 4326
)
)
study_area_proj <- st_transform(study_area_wgs84, 32650)
study_coords <- st_coordinates(study_area_proj)[, 1:2]
study_window <- owin(
xrange = range(study_coords[, 1]),
yrange = range(study_coords[, 2])
)
In the fifth and final step of the data preparation process, planar point pattern (ppp) objects were created for each dataset (baseline, gridded fraud, location spoofing, and centroid bias) using the spatstat package, allowing for the plotting, comparison, and analysis of the geographic patterns contained within the different datasets.
# 5. Create ppp objects -------------------------------------------------
make_ppp <- function(sf_obj) {
xy <- st_coordinates(sf_obj)
ppp(
x = xy[, 1],
y = xy[, 2],
window = study_window
)
}
datasets_ppp <- map(datasets_proj, make_ppp)
baseline_ppp <- datasets_ppp$Baseline
gridded_ppp <- datasets_ppp$Gridded
spoofing_ppp <- datasets_ppp$Spoofing
centroid_ppp <- datasets_ppp$Centroid
Exploratory spatial data analysis was performed on all four datasets: baseline, gridded fraud, location spoofing, and centroid bias. This included basic dataset summaries, point pattern analysis, spatial autocorrelation, and hot spot analysis.
In step 6, basic dataset summaries were produced comparing the number of unique coordinates among the datasets.
# 6. Basic dataset summaries ----------------------------------------------
summary_table <- map2_dfr(datasets, names(datasets), function(df, nm) {
tibble(
dataset = nm,
n_points = nrow(df),
unique_coords = n_distinct(paste(df$lat, df$lon)),
percent_unique = round(unique_coords / n_points * 100, 2),
min_lat = min(df$lat),
max_lat = max(df$lat),
min_lon = min(df$lon),
max_lon = max(df$lon)
)
})
summary_table
## # A tibble: 4 × 8
## dataset n_points unique_coords percent_unique min_lat max_lat min_lon max_lon
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baseline 100000 100000 100 39.6 40.6 116. 117.
## 2 Gridded 100000 81609 81.6 39.6 40.6 116 117
## 3 Spoofing 100000 100000 100 39.6 40.6 116. 117.
## 4 Centroid 100000 100000 100 39.6 40.6 116. 117.
write_csv(summary_table, "outputs/table_dataset_summary.csv")
The baseline, location spoofing, and centroid bias datasets all contained 100% unique coordinates. However, the gridded fraud dataset contained only 81% unique coordinates due to the decimal rounding that was performed on 20% of the data.
In step 7, the datasets were plotted and compared against each other with the fraudulent points in red and the original data points in black.
# 7. Four-panel point map with fraud highlighted ------------------------
plot_points <- function(df, title) {
if ("label" %in% names(df) && any(df$label != "baseline")) {
ggplot() +
geom_point(
data = df %>% filter(label == "baseline"),
aes(x = lon, y = lat),
size = 0.1,
alpha = 0.15
) +
geom_point(
data = df %>% filter(label != "baseline"),
aes(x = lon, y = lat),
size = 0.2,
alpha = 0.5,
color = "red"
) +
coord_equal(xlim = c(116, 117), ylim = c(39.6, 40.6)) +
labs(title = title, x = "Longitude", y = "Latitude") +
theme_minimal()
} else {
ggplot(df, aes(x = lon, y = lat)) +
geom_point(size = 0.1, alpha = 0.2) +
coord_equal(xlim = c(116, 117), ylim = c(39.6, 40.6)) +
labs(title = title, x = "Longitude", y = "Latitude") +
theme_minimal()
}
}
p1 <- plot_points(baseline, "A) Baseline")
p2 <- plot_points(gridded, "B) Gridded Fraud")
p3 <- plot_points(spoofing, "C) Location Spoofing")
p4 <- plot_points(centroid, "D) Centroid Bias")
point_panel <- (p1 + p2) / (p3 + p4) +
plot_annotation(title = "Baseline and Synthetic Fraud Point Patterns")
point_panel
The simple plots clearly showed the suspicious geographic patterns.
In step 8, the raw hex counts were obtained for each dataset using the hotspot_count() function from the sfhotspot package.
# 8. Raw hex counts using sfhotspot::hotspot_count() ---------------------
hex_cell_size <- 3000 # meters
make_hex_counts <- function(sf_obj, dataset_name) {
hotspot_count(
data = sf_obj,
grid_type = "hex",
cell_size = hex_cell_size,
quiet = TRUE
) %>%
rename(count = n) %>%
mutate(dataset = dataset_name)
}
hex_counts <- imap(datasets_proj, make_hex_counts)
baseline_hex <- hex_counts$Baseline
gridded_hex <- hex_counts$Gridded
spoofing_hex <- hex_counts$Spoofing
centroid_hex <- hex_counts$Centroid
# filter out empty hexes before plotting
baseline_hex_plot <- baseline_hex %>% filter(count > 0)
gridded_hex_plot <- gridded_hex %>% filter(count > 0)
spoofing_hex_plot <- spoofing_hex %>% filter(count > 0)
centroid_hex_plot <- centroid_hex %>% filter(count > 0)
plot_hex_counts <- function(hex_sf, title) {
ggplot(hex_sf) +
geom_sf(aes(fill = count), color = NA) +
scale_fill_viridis_c(option = "magma") +
labs(
title = title,
fill = "Point count"
) +
theme_minimal()
}
hc1 <- plot_hex_counts(baseline_hex_plot, "A) Baseline")
hc2 <- plot_hex_counts(gridded_hex_plot, "B) Gridded Fraud")
hc3 <- plot_hex_counts(spoofing_hex_plot, "C) Location Spoofing")
hc4 <- plot_hex_counts(centroid_hex_plot, "D) Centroid Bias")
hex_count_panel <- (hc1 + hc2) / (hc3 + hc4) +
plot_annotation(title = "Raw Point Counts in 3 km Hexagonal Cells")
hex_count_panel
The results of the raw point counts in 1km hexagonal cells showed that two of the datasets (baseline and gridded fraud) appeared similar in structure. The location spoofing dataset - where 20% of the data fell into one of five 11m x 11m sized clusters - showed a pattern that was very different from the other three datasets, with one very high count cluster that overshadowed the rest of the data. The results of the raw point counts on the centroid bias dataset were similar to the baseline dataset with the exception of a high count cell at the center of the chosen centroid.
In step 9, the final step of the dataset summaries portion, the quadrat count and chi-square test was performed for all four datasets. Quadrat count divided the study area into equal-sized cells or quadrats and counted the number of points inside each cell. The chi-square test compared the observed quadrant counts with the number of expected quadrant counts under complete spatial randomness (CSR).
# 9. Quadrat counts and chi-square test for CSR --------------------------
run_quadrat_test <- function(ppp_obj, dataset_name, nx = 10, ny = 10) {
qt <- quadrat.test(
ppp_obj,
nx = nx,
ny = ny,
method = "Chisq"
)
tibble(
dataset = dataset_name,
quadrats_x = nx,
quadrats_y = ny,
chi_square = unname(qt$statistic),
df = unname(qt$parameter),
p_value = qt$p.value
)
}
quadrat_table <- bind_rows(
run_quadrat_test(baseline_ppp, "Baseline"),
run_quadrat_test(gridded_ppp, "Gridded Fraud"),
run_quadrat_test(spoofing_ppp, "Location Spoofing"),
run_quadrat_test(centroid_ppp, "Centroid Bias")
)
quadrat_table
## # A tibble: 4 × 6
## dataset quadrats_x quadrats_y chi_square df p_value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baseline 10 10 2179596. 99 0
## 2 Gridded Fraud 10 10 2172508. 99 0
## 3 Location Spoofing 10 10 2677022. 99 0
## 4 Centroid Bias 10 10 1794014. 99 0
The results of the chi-square test showed that all datasets had the same p-value of 0. Because this value is less that 0.05, complete spatial randomness was rejected, indicating that the patterns in all datasets were not likely to be random and there was likely clustering or structure.
The chi-square statistic provided more insight into these datasets. The higher the value, the more the pattern deviated from spatial randomness. The baseline had a high chi-square statistic indicating significant clustering, which was to be expected from human movement patterns. The chi-square statistic from the gridded fraud dataset was about the same as the baseline. The chi-square of the location spoofing dataset was slightly higher than the baseline, indicating the highest deviation from spatial randomness, likely due to the very tight clusters in the data. Lastly, the centroid bias data set had the lowest chi-square value indicating the closest to spatial randomness of all the datasets.
Point pattern analysis is a statistical approach to understanding how points or events are spatially distributed in a given study area (Lloyd 2011). Typically, this approach involves using methods that investigate point intensity (the number of events expected per unit area at a specific location) and point separation (the distance between points) (Lloyd 2011). The results of point pattern analysis will provide insight into whether points are clustered, dispersed, or randomly distributed.
In step 10, kernel density estimation was used to create an intensity surface for each of the datasets. Different bandwidths were explored using 250m, 500m, 1000m, and 2000m. The chosen bandwidth was 1,000m, which appeared to be the best bandwidth for identifying patterns in the data without oversmoothing or overfitting.
# 10. Kernel density estimation with spatstat ----------------------------
# Chosen KDE bandwidth
chosen_bw <- 1000
kde_list <- map(
datasets_ppp,
~ density(.x, sigma = chosen_bw, edge = TRUE)
)
par(mfrow = c(2, 2))
plot(kde_list$Baseline, main = "A) Baseline KDE, sigma = 1000 m")
plot(kde_list$Gridded, main = "B) Gridded KDE, sigma = 1000 m")
plot(kde_list$Spoofing, main = "C) Location Spoofing KDE, sigma = 1000 m")
plot(kde_list$Centroid, main = "D) Centroid Bias KDE, sigma = 1000 m")
The intensity surfaces produced for each dataset were more similar than different, but there are a few important considerations. All KDE results had the same bright hotspot, which came from the valid data that carried through to each of the derived datasets. The KDE results for the baseline and gridded fraud datasets looked identical. The KDE results from the location spoofing dataset showed the appearance of clusters, and the KDE result for the centroid bias dataset showed an additional cluster at the center of the centroid.
In step 11, nearest neighbor analysis was conducted. Nearest neighbor analysis is used to determine whether points are clustered, dispersed, or random. It calculates the straight-line distance from each feature to its nearest neighbor and averages the distances.
# 11. Distance-based analysis: nearest neighbor --------------------------
nn_table <- map2_dfr(datasets_ppp, names(datasets_ppp), function(pp, nm) {
nn <- nndist(pp)
tibble(
dataset = nm,
mean_nn_m = mean(nn),
median_nn_m = median(nn),
min_nn_m = min(nn),
q1_nn_m = quantile(nn, 0.25),
q3_nn_m = quantile(nn, 0.75),
max_nn_m = max(nn)
)
})
nn_table
## # A tibble: 4 × 7
## dataset mean_nn_m median_nn_m min_nn_m q1_nn_m q3_nn_m max_nn_m
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baseline 29.1 9.69 0.000181 4.37 22.2 7232.
## 2 Gridded 28.3 7.66 0 1.86 20.5 7530.
## 3 Spoofing 27.0 7.48 0.000471 1.73 19.9 7232.
## 4 Centroid 55.2 14.2 0.000130 5.76 42.6 7232.
nn_plot_data <- map2_dfr(datasets_ppp, names(datasets_ppp), function(pp, nm) {
tibble(
dataset = nm,
nn_m = nndist(pp)
)
})
nn_plot <- ggplot(nn_plot_data, aes(x = dataset, y = nn_m)) +
geom_boxplot(outlier.alpha = 0.1) +
coord_cartesian(ylim = c(0, quantile(nn_plot_data$nn_m, 0.95))) +
labs(
title = "Nearest-Neighbor Distance by Dataset",
x = NULL,
y = "Nearest-neighbor distance (meters)"
) +
theme_minimal()
nn_plot
The nearest neighbor analysis showed that three of the datasets had similar results. The baseline, gridded fraud, and location spoofing datasets had similar means and box sizes. However, the centroid bias dataset had the largest or tallest box size and the highest mean as well as the fewest outliers.
In step 12, Ripley’s K and L functions were performed. Like nearest neighbor analysis, Ripley’s K and L functions provide insight into whether points are clustered, dispersed, or randomly distributed. Due to the large size of the dataset, a sample of 20,000 points was used.
Plot the results of Ripley’s K
# plot Ripley's K
par(mfrow = c(2, 2))
walk2(
k_results,
names(k_results),
~ plot(.x, main = paste(.y, "Ripley's K, 20k sample"))
)
par(mfrow = c(1, 1))
Plot the results of Riley’s L
# plot Ripley's L
par(mfrow = c(2, 2))
walk2(
l_results,
names(l_results),
~ plot(.x, main = paste(.y, "Ripley's L, 20k sample"))
)
par(mfrow = c(1, 1))
Spatial autocorrelation identifies how correlated a variable is with itself over space.
In step 13, spatial autocorrelation is explored.
Results of the Moran’s I table
# results of Moran's I
moran_table
## # A tibble: 4 × 5
## dataset moran_i expected_i variance p_value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Baseline 0.554 -0.000763 0.000202 0
## 2 Gridded Fraud 0.567 -0.000762 0.000205 0
## 3 Location Spoofing 0.367 -0.000772 0.000208 1.60e-143
## 4 Centroid Bias 0.530 -0.000772 0.000202 5.22e-306
Results of LISA
# results of LISA
plot_lisa <- function(hex_sf, title) {
ggplot(hex_sf) +
geom_sf(aes(fill = lisa_cluster), color = NA) +
scale_fill_manual(
values = c(
"High-High" = "red",
"Low-Low" = "blue",
"High-Low" = "orange",
"Low-High" = "lightblue",
"Not significant" = "grey85"
)
) +
labs(
title = title,
fill = "LISA cluster"
) +
theme_minimal()
}
l1 <- plot_lisa(baseline_lisa, "A) Baseline")
l2 <- plot_lisa(gridded_lisa, "B) Gridded Fraud")
l3 <- plot_lisa(spoofing_lisa, "C) Location Spoofing")
l4 <- plot_lisa(centroid_lisa, "D) Centroid Bias")
(l1 + l2) / (l3 + l4) +
plot_annotation(title = "Local Moran's I / LISA Clusters")
The Getis-Ord Gi* is used to determine statistically significant hot and cold spots.
in step 14, the Gi* algorithm is performed on the four datasets. Note that 1.96 is the critical z-score value used to define statistically significant hot spots and cold spots at a 95% confidence interval.
On to the plot
plot_gi <- function(hex_sf, title) {
ggplot(hex_sf) +
geom_sf(aes(fill = gi_cluster), color = NA) +
scale_fill_manual(
values = c(
"Hot spot" = "red",
"Cold spot" = "blue",
"Not significant" = "grey85"
)
) +
labs(
title = title,
fill = "Gi*"
) +
theme_minimal()
}
g1 <- plot_gi(baseline_gi, "A) Baseline Gi*")
g2 <- plot_gi(gridded_gi, "B) Gridded Fraud Gi*")
g3 <- plot_gi(spoofing_gi, "C) Location Spoofing Gi*")
g4 <- plot_gi(centroid_gi, "D) Centroid Bias Gi*")
(g1 + g2) / (g3 + g4) +
plot_annotation(title = "Getis-Ord Gi* Hotspots and Coldspots")
Overall, exploratory spatial data analysis was performed using dataset summaries, point pattern analysis, spatial autocorrelation, and hot spot analysis.
This section discusses the results of ESDA for each dataset and interprets the results in the context of the research questions.
Research Question 1: Do fraudulent and non-fraudulent RTB datasets differ in spatial clustering, dispersion, or regularity?
Research Question 2: Are there differences in hotspot structures, density surfaces, nearest-neighbor behavior, or spatial autocorrelation?
The baseline dataset contained human mobility data, and as such, clustering was to be expected. Quadrat analysis led to a rejection of CSR, and nearest neighbor distances showed that 75% of the data had nearest neighbor distances less than 25 meters. The results from the raw hex counts and KDE showed a couple of hot spots and evidence of clustering that was not apparent from the point data alone. In both Ripley’s K and L functions, the observed values (solid black) were above the expected values (red dashed), again confirming spatial clustering. In the spatial analysis portion, Global Moran’s I and Local Moran’s I both indicated significant clusters in the middle of the study area and a Moran’s I value of 0.62. This value was closer to 1 than -1, indicating positive spatial autocorrelation and clustering of similar values. Interestingly, the only significant high-low cluster in all four datasets appeared in the baseline dataset, where a high-value area was surrounded by low-value neighbors. The Getis-Ord Gi* statistic also showed hot spots in the middle of the study area.
The three datasets derived from the baseline data varied from the baseline in different ways.
The gridded fraud dataset had near-identical results to the baseline data in nearly all tests, including: raw hex counts, quadrat counts with chi-square statistics, KDE results, nearest neighbor results, Ripley’s K and L functions, LISA clusters, and Getis-Ord Gi* plots. The Moran’s I value, which was only slightly lower (10% less) than that of the baseline dataset, indicated that the gridded data had slightly less spatial clustering of similar values when compared to the baseline data. This inference makes sense considering that this dataset was derived by rounding decimals to the nearest .01 degree, creating dispersion among the events where it may not have existed before. However, the most notable difference between the two datasets was the number of unique coordinates. While the baseline had 100% unique coordinates, the gridded fraud dataset had only 81% unique coordinates.
Overall, it would be very difficult to distinguish between these two datasets based on spatial methods alone. They appeared to differ in spatial clustering only slightly, and they had near identical hot spot structures, density surfaces, and nearest-neighbor behavior. To defeat gridding from adtech-like data, more research is required to investigate the coordinates or coordinate properties (ie. length of decimals, similarity in decimal structure, etc.)
The location spoofing dataset was similar to the baseline in some ways but different in other ways. Most notably, the raw hex count and KDE intensity surface were the most visually distinct from those of the baseline. The clusters in the location spoofing dataset were readily apparent in both methods. In addition, the chi-square statistic after the quadrat counts was the highest of the datasets, and about 20% higher than the baselines chi-square statistic. A high chi-square statistic indicates significant deviation from CSR, which makes sense given that at least 20% of the data was highly clustered. The Getis-Ord Gi* plot also showed deviation from the baseline - with a smaller central hotspot and multiple smaller hot spots, likely representing the location of the five seeded clusters.
The location spoofing dataset had the lowest global Moran’s I value of all the datasets (0.15 compared to the baseline’s 0.62). A Moran’s I value close to 0 indicates weaker broad-scale spatial autocorrelation in the hexagon-count surface, even though small-scale clustering is present in the data. It’s likely that the polygons were so small (one hex cell) that they appeared closer to 0 (dispersed) than 1 (positive spatial autocorrelation). Interestingly, the Local Moran’s/LISA analysis looked more similar to the baseline’s than different. Lastly, while the nearest neighbor analysis looked similar to that of the baseline, Ripley’s K and L functions showed an interesting difference. At small distances, there was a very steep rise in observed values. The steep rise in observed values indicated a larger departure from expected values, or extreme spatial clustering, at small distances. The rise was more pronounced in the L function than the K function.
Overall, it appears possible to distinguish between these two datasets using spatial methods, specifically using density-based techniques, Ripley’s L function, and global Moran’s I. Compared to the baseline, this dataset scored higher when measuring deviation from CSR (on a chi-squared test after quadrat counts) but also displayed broad patterns that were closer to CSR at the global level - likely due to the size of the polygons compared with the scale of the study area. There are pronounced differences in density structures, hot spot structures, and spatial autocorrelation.
The centroid bias dataset was also similar to the baseline in some ways but different in other ways. The results of the raw hex counts, KDE, Gi*, and Ripley’s K and L function were very similar to those of the baseline dataset, with the addition of a hot spot at the center of the centroid. The global Moran’s I statistic (0.63) was very close to that of the baseline dataset (0.62), both indicating positive spatial autocorrelation (like values clustered). This dataset had the lowest chi-square statistic (equal to about 82% of the baseline chi-square statistic). Though still a deviation from CSR and still clustered, the lesser number indicates less clustering than the baseline. Interestingly, the nearest neighbor analysis deviated from the baseline and all other datasets. This dataset had the widest range in neighbor distances and about half as many outliers. About 75% of the data was 42 meters or less from its nearest neighbor, compared to 22 meters from the baseline. Lastly, the most interesting analysis was local Moran’s/LISA because this dataset had the most cold spots, and all of the cold spots were outside of the centroid bias area.
Overall, while the dataset exhibited spatial clustering, it measured as less clustered than the baseline and, at the same time, exhibited higher indicators of spatial randomness. Only slight differences in density surfaces and hot spot structures were noted. This was the only dataset with major differences in nearest neighbor analysis and local (but not global) spatial autocorrelation.
I was pleasantly surprised at how quickly the code ran. Working with multiple datasets containing 100,00 points each, I would have expected the processing to be much slower, but that wasn’t the case. It was fairly easy to work with packages that I had never worked with before (specifically spatstat and sfdep).
While the R script ran smoothly, getting the code to work in an R markdown file was not without challenges. The biggest challenge was related to memory storage, sometimes including failing to plot or knit. I tried multiple ways to fix the issue, including: caching code blocks, changing parameters like hex bin resolution and figure dpi, filtering out data with 0 counts, sampling data from 100,000 points down to 20,000 points, using eval=FALSE to not run code, removing options to save plots, and breaking code blocks into computation blocks and plotting blocks (and using include=FALSE to run computational code blocks silently). I had to think carefully about what code I ran and how I handled the data.
If I had more time I’d make an interactive map that would allow users to explore the geographic patterns. However, due to memory constraints and time constraints, I did not explore this further.
In the future, this analysis could be made stronger with additional tests and follow-on actions. For example, it would have been interesting to compare the Moran scatter plots of the datasets or apply DBSCAN techniques to try and remove high density clusters from the location spoofing (high density cluster) data.
This project performed exploratory spatial data analysis (ESDA) on multiple datasets containing adtech-like data in Beijing. One dataset contained adtech-like data, and three datasets were derived from the adtech-like data but also contained data with suspicious geographic patterns (gridding, very small high density clusters, and centroid bias. Multiple ESDA methods were used to explore the data, including basic data summarizations, point pattern analysis, spatial autocorrelation, and hot spot analysis. Results indicated that all datasets exhibited clustering and rejected CSR. The gridding dataset was more dispersed that the baseline, but its density surface, hot spot structure, and nearest neighbor analysis was about the same. The location spoofing dataset (high density clusters) was more clustered at smaller distances and had significant differences in its density surface and hot spot structure.The centroid bias dataset had higher indicators of spatial randomness. Future research could be spent further exploring non-spatial ways to detect suspicious geographic patterns (for example, looking at characteristics of coordinates) or implementing techniques that would remove the fraudulent data from the relevant datasets.
Baker & Hostetler LLP. (2023, July 18). What even is ad tech? | ADventures in Law. Retrieved on March 27, 2026 from https://www.adventures-in-law.com/blogs/what-even-is-ad-tech/.
HUMAN Security. (n.d.) Understanding Ad Fraud: Ad Fraud Defined and How to Prevent it. Retrieved on March 27, 2026 from https://www.humansecurity.com/learn/topics/what-is-ad-fraud/.
Liao, H., Peng, L., Liu, Z., & Shen, X. (2014). iPinYou Global RTB Bidding Algorithm Competition Dataset. ADKDD’14: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, 1–6. https://doi.org/10.1145/2648584.2648590.
Lloyd, Christopher D. (2011). Local Models for Spatial Analysis, Second Edition. (Chapter 8: Point Patterns and Cluster Detection). Retrieved May 3, 2026, from https://ebookcentral.proquest.com/lib/pensu/reader.action?docID=589913&ppg=260.
Mader, Jeffery. (November 2, 2025). The Art of Observable Anomalies: Reading Ad-Tech Location Data Like an OSINT Pro. Medium. Retrieved on March 27, 2026 from https://medium.com/signal-noise-interpreting-the-open-source-world/the-art-of-observable-anomalies-reading-ad-tech-location-data-like-an-osint-pro-65458986fd49.
Spatial Reference. (n.d.). EPSG:32650 WGS 84 / UTM zone 50N – Spatial Reference. Retrieved on April 26, 2026, from https://spatialreference.org/ref/epsg/32650/.