This project investigates spatial patterns and dependencies of theft incidents in Taoyuan City, Taiwan. Using geo-localized point data, the study evaluates whether theft occurrences exhibit clustering tendencies, identifies significant hotspots, and assesses the extent of spatial autocorrelation. Project include spatial visualizations, statistical diagnostics, kernel density estimation, spatial econometrics and insights to urban crime pattern.
Crime demonstrates non-random spatial patterns, influenced by land use, population density and policing allocation. Understanding these mechanisms is essential for promote effective stategies.
This project leverages real Taiwanese police incident records to examine two questions:
Do theft incidents exhibit significant spaital clustering, forming hotspot?
Is there evidence of spatial dependence- fo theft in one area correlate with neighboring areas?
To address above questions, this project applies following methods,to gain insights and provide data-driven policy.
** Spatial statistics
** Spatial weights matrix construction (KNN)
** Spatial dependence models (Moran’s I, LISA)
** Visualization of spatial phenomena
** Spatial econometrics model(SAR, SEM)
Data Type: Point-based geo-localized crime incident records Source: Taiwan Open Government Data Link: https://data.gov.tw/dataset/167673 Content: Contains the geographic coordinates and attributes of reported crime incidents. Spatial Processing: Filtering applied to remove mislocated or invalid coordinates. Spatial Boundary: Restricted to Taoyuan City, Taiwan.
Data Type: Demographic and spatial statistical data Source: Taiwan Open Government Data Link: https://data.gov.tw/dataset/18506 Content: Provides resident population counts and population density by administrative district. Spatial Boundary: Taoyuan City, Taiwan.
theft<-read.csv("/Users/ninalin/Desktop/dsba/winter semester 25/spatial econometrics/term project/Theft Incident Locations.csv")
# View(earthquake)
is.null(theft) # No missing value in our data
## [1] FALSE
# Filter out wrong data focus on Taoyuan City
theft <- theft %>%
filter(
Latitude >= 24.7,
Latitude <= 25.2,
Longitude >= 121.1,
Longitude <= 121.5)
str(theft)
## 'data.frame': 18043 obs. of 7 variables:
## $ Case.Type : chr "Bicycle Theft" "Bicycle Theft" "Bicycle Theft" "Bicycle Theft" ...
## $ Incident.Year: int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
## $ Date : int 20230120 20230113 20230113 20230113 20230113 20230109 20230109 20230109 20230109 20230130 ...
## $ Agency.Code : chr "380132100C" "380132100C" "380132100C" "380132100C" ...
## $ Precinct : chr "Zhongli Police Precinct" "Zhongli Police Precinct" "Zhongli Police Precinct" "Zhongli Police Precinct" ...
## $ Latitude : num 25 25 25 25 25 ...
## $ Longitude : num 121 121 121 121 121 ...
summary(theft)
## Case.Type Incident.Year Date Agency.Code
## Length:18043 Min. :2015 Min. :20150101 Length:18043
## Class :character 1st Qu.:2016 1st Qu.:20160320 Class :character
## Mode :character Median :2017 Median :20170906 Mode :character
## Mean :2018 Mean :20177876
## 3rd Qu.:2019 3rd Qu.:20190920
## Max. :2023 Max. :20231230
## Precinct Latitude Longitude
## Length:18043 Min. :24.78 Min. :121.1
## Class :character 1st Qu.:24.94 1st Qu.:121.2
## Mode :character Median :24.96 Median :121.3
## Mean :24.97 Mean :121.3
## 3rd Qu.:24.99 3rd Qu.:121.3
## Max. :25.12 Max. :121.5
nrow(theft)
## [1] 18043
# Taoyuan city map
twn1 <- st_read("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_TWN.gpkg",layer = "ADM_ADM_1")
## Reading layer `ADM_ADM_1' from data source
## `https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_TWN.gpkg'
## using driver `GPKG'
## Simple feature collection with 7 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 116.71 ymin: 20.6975 xmax: 122.1085 ymax: 26.38542
## Geodetic CRS: WGS 84
twn2 <- st_read("https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_TWN.gpkg",layer = "ADM_ADM_2")
## Reading layer `ADM_ADM_2' from data source
## `https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_TWN.gpkg'
## using driver `GPKG'
## Simple feature collection with 22 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 116.71 ymin: 20.6975 xmax: 122.1085 ymax: 26.38542
## Geodetic CRS: WGS 84
taoyuan_1 <- twn1 %>% filter(NAME_1 == "Taoyuan")
taoyuan_2 <- twn2 %>% filter(NAME_2 == "Taoyuan")
ggplot() +
geom_sf(data = twn2, color = "gray80", fill = "gray95") +
geom_sf(data = taoyuan_2, fill = "white", color = "black", size = 1) +
coord_sf(
xlim = c(120.95, 121.50),
ylim = c(24.55, 25.15),
expand = FALSE
) +
theme_minimal() +
labs(title = "Taoyuan (Figure 1)")
# Raw point data
theft_sf <- st_as_sf(theft, coords = c("Longitude", "Latitude"), crs = 4326)
ggplot() +
geom_sf(data = taoyuan_1, fill = NA, color = "grey20", size = 0.3) +
geom_sf(data = theft_sf, color = "grey10", size = 0.02, alpha = 0.25) +
theme_void() +
coord_sf(expand = FALSE) +
theme(
plot.title = element_text(size = 16, hjust = 0.5)
) +
labs(title = "Spatial Distribution of Theft Incidents(Figure 2)")
# Crime data visualization
leaflet(data = theft) %>%
addProviderTiles("OpenStreetMap (Figure 3)") %>%
addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions())
The two visualizations reveal a strong pattern in theft incidents across Taoyuan City, particularly in Taoyuan, Zhongli, and Pingzhen districts. Rural areas display sparse activity. The dense point clusters indicates hotspot phenomenon. However, further spatial statistical testing is needed to confirm the significance of this spatial clustering.
## KDE (Kernel Density Estimation)
theft_sf <- st_as_sf(theft, coords = c("Longitude", "Latitude"), crs = 4326)
theft_sf_3826 <- st_transform(theft_sf, 3826)
coords <- st_coordinates(theft_sf_3826)
win <- as.owin(st_bbox(theft_sf_3826))
theft_ppp <- ppp(
x = coords[, 1],
y = coords[, 2],
window = win)
## Warning: data contain duplicated points
kde_smooth <- density.ppp(theft_ppp, sigma = 1500)
kde_raster <- raster(kde_smooth)
kde_df <- as.data.frame(kde_raster, xy = TRUE)
colnames(kde_df) <- c("x", "y", "value")
ggplot() +
geom_raster(data = kde_df, aes(x = x, y = y, fill = value), alpha = 0.85) +
scale_fill_viridis(option = "magma", name = "Density") +
geom_point(
data = as.data.frame(coords),
aes(X, Y),
color = "white", size = 0.25, alpha = 0.4
) +
coord_equal() +
theme_minimal() +
labs(
title = "KDE Hotspot Map of Theft Incidents (σ = 1500)(Figure 4)",
x = "X (meters)",
y = "Y (meters)")
The KDE hotspot patterns indicates that theft incident cluster heavily in major urban districts, confirming the presence of non-random spatial clustering.
# Spatial Weights Matrix
coords_jitter <- jitter(coords, amount = 1)
## KNN = 5
knn5 <- knearneigh(coords_jitter, k = 5)
nb_knn5 <- knn2nb(knn5)
lw_knn5 <- nb2listw(nb_knn5, style = "W")
plot(nb_knn5, coords_jitter, col = "#FFBF00")
points(coords_jitter, pch = 20, cex = 0.5)
title("KNN Spatial Weights (k = 5)(Figure 5)")
## KNN = 8
knn8 <- knearneigh(coords_jitter, k = 8)
nb_knn8 <- knn2nb(knn8)
lw_knn8 <- nb2listw(nb_knn8, style = "W")
plot(nb_knn8, coords_jitter, col = "#134686")
points(coords_jitter, pch = 20, cex = 0.5)
title("KNN Spatial Weights (k = 8)(Figure 6)")
data.frame(
Min_Neighbors = min(card(nb_knn8)),
Max_Neighbors = max(card(nb_knn8)),
Mean_Neighbors = mean(card(nb_knn8))
)
## Min_Neighbors Max_Neighbors Mean_Neighbors
## 1 8 8 8
Perform K-Nearest Neighbors spatial weight matrix to capture interaction between theft location. No isolated observations and a uniform neighbor distribution which ensure numerical stability for subsequent spatial autocorrelation test. Dense regions show tight connection, while space regions exhibit longer distance.
## Global Moran's I using KDE values at event locations
nn <- get.knnx(kde_df[,1:2], coords_jitter, k = 1)
kde_value <- kde_df$value[nn$nn.index]
moran_knn5 <- moran.test(kde_value, lw_knn5)
moran_knn8 <- moran.test(kde_value, lw_knn8)
moran_knn5
##
## Moran I test under randomisation
##
## data: kde_value
## weights: lw_knn5
##
## Moran I statistic standard deviate = 228.14, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 9.979903e-01 -5.542623e-05 1.913859e-05
moran_knn8
##
## Moran I test under randomisation
##
## data: kde_value
## weights: lw_knn8
##
## Moran I statistic standard deviate = 286.57, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 9.973059e-01 -5.542623e-05 1.211281e-05
Global Moran’s I shows extremely strong positive spatial autocorrelation (I ≈ 0.99, p < 0.001) in the KDE-estimated theft intensity surface. Confirming that theft intensity is highly spatially clustered across Taoyuan City. The value for both KNN= 5 and KNN=8 indicated that this clustering pattern is robust to different spatial weight specification.
# LISA
lisa <- localmoran(kde_value, lw_knn5)
lisa_df <- cbind(
as.data.frame(coords_jitter),
Ii = lisa[,1],
p = lisa[,5])
theft_sf$Local_I <- lisa[,1]
theft_sf$p_value <- lisa[,5]
# Define category
alpha <- 0.05
theft_sf$cluster <- "Not Significant"
theft_sf$cluster[theft_sf$Local_I > 0 & kde_value > mean(kde_value) & theft_sf$p_value < alpha] <- "High-High"
theft_sf$cluster[theft_sf$Local_I > 0 & kde_value < mean(kde_value) & theft_sf$p_value < alpha] <- "Low-Low"
theft_sf$cluster[theft_sf$Local_I < 0 & kde_value > mean(kde_value) & theft_sf$p_value < alpha] <- "High-Low"
theft_sf$cluster[theft_sf$Local_I < 0 & kde_value < mean(kde_value) & theft_sf$p_value < alpha] <- "Low-High"
theft_sf$cluster <- factor(
theft_sf$cluster,
levels = c("High-High", "Low-Low", "High-Low", "Low-High", "Not Significant"))
tmap_mode("plot")
## tmap mode set to plotting
tm_shape(theft_sf) +
tm_dots(
col = "cluster",
palette = c(
"High-High" = "#EA2F14",
"Low-Low" = "#FB9E3A",
"High-Low" = "#E6521F",
"Low-High" = "#FCEF91",
"Not Significant" = "#EBE8DB"
),
size = 0.1,
border.col = NA
) +
tm_layout(
title = "LISA Cluster Map of Theft KDE (KNN = 5)(Figure 7)",
title.position = c("center", "top"),
legend.outside = TRUE)
The Lisa cluster map revealed two High-High hotspots, indicating areas where theft intensity is elevated. Surrounding Low-Low regions highlight districts with consistently low theft activity.
The results of spatial analysis demonstrate theft incidents in Taoyuan City exhibit a clear and statistically significant pattern of spatial clustering. These areas display dense point intensity and prominent high-density surfaces, indicating persistent crime concentration rather than random spatial occurrence.
# Population data
xml <- read_xml("/Users/ninalin/Desktop/dsba/winter semester 25/spatial econometrics/term project/popu.xml")
nodes <- xml_find_all(xml, ".//桃園市常住人口數及人口密度")
extract_block <- function(node) {
tibble(
district = xml_text(xml_find_first(node, "./按鄉鎮市區別分_By_Township_City_District")),
pop_total = xml_text(xml_find_first(node, "./常住人口數_總計_人_Number_of_resident_population_Grand_total_person")),
pop_male = xml_text(xml_find_first(node, "./常住人口數_男_人_Number_of_resident_population_Male_person")),
pop_female = xml_text(xml_find_first(node, "./常住人口數_女_人_Number_of_resident_population_Female_person")),
area_km2 = xml_text(xml_find_first(node, "./土地面積_平方公里_Total_land_area_km2")),
density = xml_text(xml_find_first(node, "./人口密度_Population_density"))
)}
pop_df <- bind_rows(lapply(nodes, extract_block))
head(pop_df)
## # A tibble: 6 × 6
## district pop_total pop_male pop_female area_km2 density
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 桃園市 "2441064" "1230476" "1210588" "1221" "1999.3"
## 2 Taoyuan City "" "" "" "" ""
## 3 桃園區 "458566" "224189" "234377" "34.8" "13175.4"
## 4 Taoyuan District "" "" "" "" ""
## 5 中壢區 "468641" "231923" "236718" "76.5" "6124.4"
## 6 Zhongli District "" "" "" "" ""
# English format
pop_df2 <- pop_df %>%
mutate(across(everything(), ~ str_squish(.)))
pop_df2 <- pop_df2 %>%
mutate(is_chinese = str_detect(district, "[\u4e00-\u9fff]"))
cols_to_fill <- c("pop_total", "pop_male", "pop_female", "area_km2", "density")
for (col in cols_to_fill) {
pop_df2[[col]] <- ifelse(
!pop_df2$is_chinese & (pop_df2[[col]] == "" | is.na(pop_df2[[col]])),
lag(pop_df2[[col]]),
pop_df2[[col]]
)}
clean_pop <- pop_df2 %>%
filter(!is_chinese) %>%
dplyr::select(-is_chinese)
clean_pop <- clean_pop %>%
mutate(across(all_of(cols_to_fill), ~ as.numeric(.)))
head(clean_pop)
## # A tibble: 6 × 6
## district pop_total pop_male pop_female area_km2 density
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Taoyuan City 2441064 1230476 1210588 1221 1999.
## 2 Taoyuan District 458566 224189 234377 34.8 13175.
## 3 Zhongli District 468641 231923 236718 76.5 6124.
## 4 Daxi District 103101 53457 49644 105. 981.
## 5 Yangmei District 182572 94057 88515 89.1 2048.
## 6 Luzhu District 184664 93769 90895 75.5 2446.
# preprocess
theft_2023 <- theft %>%
filter(Incident.Year == 2023) %>%
mutate(Precinct = trimws(Precinct)) %>%
filter(!Precinct %in% c("Xinzhuang Police Precinct",
"Linkou Police Precinct"))
mapping <- tibble(
precinct = c(
"Bade Police Precinct",
"Daxi Police Precinct",
"Dayuan Police Precinct",
"Guishan Police Precinct",
"Longtan Police Precinct",
"Luzhu Police Precinct",
"Pingzhen Police Precinct",
"Taoyuan Police Precinct",
"Yangmei Police Precinct",
"Zhongli Police Precinct"
),
district = c(
"Bade District",
"Daxi District",
"Dayuan District",
"Guishan District",
"Longtan District",
"Luzhu District",
"Pingzhen District",
"Taoyuan District",
"Yangmei District",
"Zhongli District"
)
)
theft_2023 <- theft_2023 %>%
left_join(mapping, by = c("Precinct" = "precinct"))
setdiff(unique(theft_2023$Precinct), mapping$precinct)
## character(0)
theft_district_2023 <- theft_2023 %>%
group_by(district) %>%
summarise(theft_count = n(), .groups = "drop")
final_data <- clean_pop %>%
left_join(theft_district_2023, by = "district") %>%
mutate(theft_count = ifelse(is.na(theft_count), 0, theft_count)) %>%
filter(district != "Taoyuan City") # remove non-district row
# District Neighbnor list
neighbors_list <- list(
"Bade District" = c("Taoyuan District", "Daxi District", "Guishan District"),
"Daxi District" = c("Bade District", "Zhongli District", "Longtan District"),
"Dayuan District" = c("Luzhu District", "Guanyin District", "Zhongli District", "Pingzhen District"),
"Fuxing District" = c("Daxi District", "Longtan District"),
"Guishan District" = c("Bade District", "Taoyuan District", "Luzhu District"),
"Longtan District" = c("Daxi District", "Zhongli District", "Yangmei District", "Fuxing District", "Pingzhen District"),
"Luzhu District" = c("Guishan District", "Taoyuan District", "Dayuan District"),
"Pingzhen District"= c("Zhongli District", "Longtan District", "Yangmei District", "Dayuan District"),
"Taoyuan District" = c("Bade District", "Guishan District", "Luzhu District"),
"Yangmei District" = c("Pingzhen District", "Zhongli District", "Longtan District"),
"Zhongli District" = c("Pingzhen District", "Longtan District", "Daxi District", "Dayuan District", "Yangmei District"),
"Xinwu District" = c("Guanyin District", "Dayuan District"),
"Guanyin District" = c("Xinwu District", "Dayuan District")
)
district_order <- final_data$district
nb_obj <- lapply(district_order, function(d){
match(neighbors_list[[d]], district_order)
})
class(nb_obj) <- "nb"
attr(nb_obj, "region.id") <- district_order
# Spatial weight matrix
W <- nb2listw(nb_obj, style = "W")
# Moran test
moran.test(final_data$theft_count, listw = W)
##
## Moran I test under randomisation
##
## data: final_data$theft_count
## weights: W
##
## Moran I statistic standard deviate = 0.4974, p-value = 0.3095
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.00342022 -0.08333333 0.03042072
The Global moran test indicate no detectable spatial autocorrelation.
# SAR model
model_sar <- lagsarlm(
theft_count ~ pop_total + density,
data = final_data,
listw = W
)
summary(model_sar)
##
## Call:lagsarlm(formula = theft_count ~ pop_total + density, data = final_data,
## listw = W)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.7827 -17.1705 -4.9493 9.1369 66.1067
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.84164690 19.63605644 -0.1447 0.8849
## pop_total 0.00059282 0.00011234 5.2768 1.314e-07
## density -0.00389647 0.00435451 -0.8948 0.3709
##
## Rho: -0.18018, LR test value: 0.60753, p-value: 0.43572
## Asymptotic standard error: 0.18392
## z-value: -0.97965, p-value: 0.32726
## Wald statistic: 0.95972, p-value: 0.32726
##
## Log likelihood: -61.10698 for lag model
## ML residual variance (sigma squared): 702.3, (sigma: 26.501)
## Number of observations: 13
## Number of parameters estimated: 5
## AIC: 132.21, (AIC for lm: 130.82)
## LM test for residual autocorrelation
## test value: 4.4038, p-value: 0.035858
The result indicate theft levle in one district do not significantly depend on theft levels in neighboring districts(no evidence of spatial spillover). Population size shows a positive association with theft incident, areas with more resident tends to have more thefts. While population density reflecting a negative relation, possibly reflecting social monitoring effects.
# SEM model
model_sem <- errorsarlm(
theft_count ~ pop_total + density,
data = final_data,
listw = W
)
summary(model_sem)
##
## Call:
## errorsarlm(formula = theft_count ~ pop_total + density, data = final_data,
## listw = W)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.6975 -14.5092 -1.5132 13.1696 38.2280
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0902e+01 7.3053e+00 -1.4924 0.13560
## pop_total 5.8971e-04 6.9226e-05 8.5187 < 2e-16
## density -5.5413e-03 2.6296e-03 -2.1073 0.03509
##
## Lambda: -0.9909, LR test value: 6.174, p-value: 0.012964
## Asymptotic standard error: 0.26221
## z-value: -3.7791, p-value: 0.00015741
## Wald statistic: 14.281, p-value: 0.00015741
##
## Log likelihood: -58.32373 for error model
## ML residual variance (sigma squared): 349.09, (sigma: 18.684)
## Number of observations: 13
## Number of parameters estimated: 5
## AIC: 126.65, (AIC for lm: 130.82)
Population has a strongly positive effect on theft while population density exhibits a negative effect, suggest that denser district may benefit from stronger natural surveillance.
The spatial result indicate demographic factors are signicant in explaining district-level theft incidents in Taoyuan City. Population size shows a high correlation with theft counts, suggesting that districts with larger populations face more theft incidents. In contrast, population density is negatively associated with theft, implying that compact urban areas may benefit from stronger policing presence.
This study demonstrate a examination of theftpatterns in Taoyuan City by combining point hotsopts with spatial econometrics model. The Kernel density and LISA reveal clear and statistically significant spatial clustering, with theft incidents heavily concentrated in major urban districts such as Taoyuan, Zhongli, and Pingzhen. These localized hotspots confirming that theft does not occur randomly but gravitates toward dense activity centers.
However, both SAR and SEM models indicate no significant spatial dependence across districts, suggesting that hotspot formation does not spill over administrative boundaries. Instead, theft variation indicates districts with larger populations experience more theft, while higher population density is associated with lower theft levels.
Together, these findings illustrate how spatial analysis operates at multiple scales: point-level methods detect fine-grained clustering, while spatial econometrics clarifies whether such patterns persist at broader administrative levels. In Taoyuan’s case, theft is spatially clustered locally but not structurally interconnected across districts, highlighting the importance of scale in interpreting urban crime dynamics.