Scope of the report

In criminological research using crime datasets and related administrative records, two disclosure problems usually have to be addressed separately: sensitive locations and identifiers that support linkage. This document combines two linked demonstrations that address each in turn:

Part I: how alternative masking methods change a real Chennai crime location choice model when compared with the unmasked baseline
Part II: how consistent HMAC-SHA256 pseudonymization preserves cross-dataset linkage in a Belgian police-style linked-data workflow

Part I addresses spatial masking and analytical distortion. Part II addresses identifier protection and cross-file linkage.

Part I presents an empirical Chennai application based on the output set in outputs/chennai_masking_alternatives/prior_within_1km/. The main specification operationalizes prior spatial familiarity with the boundary-free term prior_crime_within_1km_any and compares masked estimates against the unmasked baseline of that same specification. A reduced specification that drops the prior-location term entirely is retained as an appendix robustness check.

Part II (Sections 1--8 of the pseudonymization chapter) presents the identifier-management side of the same pipeline. Three linked synthetic administrative datasets -- incident register, offender records, victim records -- were structured as Belgian police-style linked files, drawing on systems such as ISLP and related police databases. Direct identifiers such as name, RRN, address, and PV number were replaced using keyed pseudonymization. The section then shows whether the intended joins remain valid, whether quasi-identifier risk can be detected and suppressed through k-anonymity, and how temporal and free-text fields can be handled within the same workflow.

Part I: Chennai Real-Data Spatial Masking Analysis

Part I presents a real-data Chennai masking analysis based on the published snatching location choice study. The underlying script and output files are located in scripts/chennai_alternative_masking_analysis.R and outputs/chennai_masking_alternatives/. Because this is a real-data setting, the correct benchmark is the unmasked baseline model, not a set of known true parameters.

How Part I should be read: The masking methods are compared against the unmasked Chennai baseline of the same specification. Lower RMSE means the masked model remains closer to the original fitted result. In the main Chennai specification, both crime points and offender-home points are masked, crime wards are reassigned after masking, and prior spatial familiarity is measured with prior_crime_within_1km_any rather than a same-ward indicator.

1. Chennai baseline and comparison logic

The Chennai analysis uses the real snatching data and an adapted published-style Model 2 structure used in scripts/chennai_alternative_masking_analysis.R: distance, prior_crime_within_1km_any, area, population, and the ward-level opportunity covariates. This familiarity measure is used instead of the original same-ward prior-crime indicator so that the model is less dependent on arbitrary administrative boundaries once crime and home points are spatially masked. The baseline model is the unmasked conditional logit model of that same specification. Alternative grid and geomasking methods are then compared against that baseline. The summary table below ranks the tested methods by overall deviation from the unmasked result.

This comparison is also relevant from the perspective of spatial units of analysis. A substantial literature on the modifiable areal unit problem shows that statistical relationships can change when phenomena are aggregated into different spatial units, especially when administrative boundaries do not align well with the behavioral processes under study (Fotheringham & Wong, 1991; Openshaw, 1984). In crime analysis specifically, variation across spatial units is often large enough to alter interpretation, which means that masking should be evaluated not only as a privacy intervention but also as a change to the spatial representation through which offender decision-making is measured (Steenbeek & Weisburd, 2016; Weisburd et al., 2012).

The familiarity term is especially sensitive to this issue. A same-ward indicator is convenient when prior offending is recorded by administrative unit, but it can also treat a boundary crossing as a substantive change in offender familiarity even when the masked event remains geographically close to the original location. A boundary-free formulation such as prior_crime_within_1km_any is therefore more robust to small positional shifts and more consistent with arguments in spatial criminology that offenders' awareness spaces and relevant opportunity structures do not necessarily coincide with administrative borders (Bernasco & Nieuwbeerta, 2005; Song et al., 2017).

Table 1: Chennai real-data masking methods ranked against the unmasked baseline
Method	Interpretation	RMSE vs baseline	Max \|bias\|	Worst term	Crime point stays in same ward (%)	Home point stays in same ward (%)
Grid 250m	Best preservation of the unmasked baseline	0.017	0.038	Marriage halls (# 10)	86.5	87.2
Geomask 50-300m	Very close to the unmasked baseline	0.028	0.075	Mosques (# 10)	79.0	79.3
Grid 500m	Usable, but with some extra distortion	0.040	0.124	Any prior crime within 1 km (0,1)	78.7	79.1
Geomask 200-600m	More noticeable distortion	0.084	0.336	Any prior crime within 1 km (0,1)	59.0	59.2
Grid 1000m	More noticeable distortion	0.091	0.224	Mosques (# 10)	58.1	62.6
Geomask 400-1200m	More noticeable distortion	0.114	0.496	Any prior crime within 1 km (0,1)	34.5	32.0

The best-performing masking method in this analysis is Grid 250m. In these results, Grid 250m gives the closest agreement with the unmasked Chennai baseline, followed by Geomask 50-300m and Grid 500m. The wider masking settings produce more noticeable distortion, but the main substantive pattern of the model remains recognizable.

2. Baseline vs best masking method

The next table compares the unmasked baseline directly against the best-performing masking method. It reports odds ratios, p-values, and the percentage change in the odds ratio. This is the most direct way to judge whether masking changes the substantive interpretation of the Chennai model under the boundary-free familiarity specification.

Table 2: Chennai baseline vs best masking method (Grid 250m)
Term	Baseline OR	Baseline p	Best model	Best-model OR	Best-model p	OR difference	% change from baseline	Direction changed	Significance changed
Distance (km)	0.356	0.000	Grid 250m	0.350	0.000	-0.007	-1.832	FALSE	FALSE
Any prior crime within 1 km (0,1)	10.018	0.000	Grid 250m	9.741	0.000	-0.277	-2.767	FALSE	FALSE
Area (km2)	1.075	0.000	Grid 250m	1.092	0.000	0.017	1.576	FALSE	FALSE
Population (# 1000)	1.004	0.208	Grid 250m	1.003	0.403	-0.001	-0.132	FALSE	FALSE
Retail stores (# 10)	1.010	0.358	Grid 250m	1.013	0.261	0.002	0.228	FALSE	FALSE
Transit stations (# 10)	1.039	0.432	Grid 250m	1.031	0.524	-0.008	-0.727	FALSE	FALSE
Mosques (# 10)	1.073	0.450	Grid 250m	1.093	0.341	0.020	1.859	FALSE	FALSE
Temples (# 10)	1.009	0.760	Grid 250m	0.980	0.494	-0.029	-2.916	TRUE	FALSE
Churches (# 10)	1.182	0.000	Grid 250m	1.161	0.000	-0.021	-1.757	FALSE	FALSE
Education institutions (# 10)	1.072	0.000	Grid 250m	1.066	0.000	-0.005	-0.507	FALSE	FALSE
School and college (# 10)	0.989	0.669	Grid 250m	0.994	0.814	0.005	0.495	FALSE	FALSE
Personal care (# 10)	1.066	0.003	Grid 250m	1.052	0.021	-0.015	-1.379	FALSE	FALSE
Hospitals (# 10)	0.999	0.950	Grid 250m	1.000	0.994	0.001	0.113	TRUE	FALSE
Marriage halls (# 10)	1.123	0.004	Grid 250m	1.168	0.000	0.044	3.921	FALSE	FALSE
Jewelleries (# 10)	1.025	0.163	Grid 250m	1.035	0.045	0.010	1.003	FALSE	TRUE
Textiles (# 10)	0.987	0.128	Grid 250m	0.984	0.058	-0.003	-0.331	FALSE	FALSE
Park (# 10)	1.176	0.017	Grid 250m	1.164	0.024	-0.012	-1.018	FALSE	FALSE
Recreation facilities (# 10)	0.957	0.124	Grid 250m	0.984	0.556	0.026	2.743	FALSE	FALSE
Restaurant (# 10)	1.028	0.033	Grid 250m	1.035	0.009	0.006	0.612	FALSE	FALSE
Government office (# 10)	1.058	0.013	Grid 250m	1.062	0.007	0.004	0.375	FALSE	FALSE

The main findings remain stable under the best masking method. The distance effect remains negative and strong, and the prior_crime_within_1km_any effect remains strongly positive. The core criminological interpretation therefore does not reverse under the best-performing masking specification.

3. Visual comparison of masking performance

The first figure compares odds ratios across the masking scenarios. The dashed line in each panel marks the unmasked baseline estimate. The second figure ranks the methods by overall coefficient deviation from the baseline.

Figure 1: Chennai real-data odds ratio comparison across masking methods. The dashed reference line in each panel marks the unmasked baseline estimate for that term.

Figure 2: Chennai real-data masking methods ranked by RMSE of coefficient deviation from the unmasked baseline. Lower values indicate better preservation of the original model.

4. Interpretation

For the Chennai real-data application, the practical question is not which masking method is universally best, but which method preserves the original analytical result most closely. In this run, the smaller masking settings perform best. Grid 250m produces the lowest overall deviation from the unmasked model, with Geomask 50-300m also remaining close. The wider geomasking and coarser grid settings introduce more noticeable coefficient drift, especially for the familiarity term when points cross wards more often.

This should therefore be read as a baseline-deviation study. The goal is to test how far the masked model moves away from the original fitted Chennai result once sensitive crime and home locations are transformed. Because the benchmark is the unmasked model rather than true population parameters, the interpretation is straightforward: smaller RMSE means better analytical preservation.

5. Chennai Part I output files

The Chennai analysis itself is run outside this report. The table below lists the output files currently consumed by Part I.

Table 3: Chennai Part I output files loaded by this report
File	Folder	Size (KB)	Description
Baseline Coefficients	prior within 1km	1.4	Baseline coefficient estimates from the unmasked Chennai model
Baseline Compare Table	prior within 1km	1.0	Paper-style baseline odds-ratio comparison file
Baseline Odds Ratios	prior within 1km	1.8	Baseline odds ratios with confidence intervals
Baseline Vs Best Model	prior within 1km	2.1	Reader-friendly baseline vs best-method comparison
Bias Table	prior within 1km	11.5	Coefficient-level bias table across all masking methods
Coefficient Comparison	prior within 1km	462.0	Odds-ratio comparison plot across masking methods
Deviation Table	prior within 1km	14.3	Detailed deviation of each coefficient from baseline
Easy Method Summary Table	prior within 1km	1.0	Plain-language method ranking summary
Model Coefficients	prior within 1km	9.0	All masked and baseline coefficient estimates
Model Odds Ratios	prior within 1km	11.4	All masked and baseline odds ratios
RMSE Bias Comparison	prior within 1km	73.3	RMSE comparison plot across masking methods
RMSE Summary	prior within 1km	0.4	Method ranking by RMSE and max absolute bias
Ward Shift Summary	prior within 1km	0.2	Ward stability summary after masking

6. Appendix Note: Reduced-Specification Robustness Check

As a robustness check, I also estimated a reduced specification that drops the prior-location term entirely and re-runs the masking comparison on that reduced model. This is not used as the main specification, because it removes an important spatial familiarity mechanism. It is retained as an appendix-style check to show that the main conclusions do not depend entirely on the 1 km familiarity adaptation.

Appendix Table A1: Reduced-specification robustness check
Robustness specification	Best masking method	Best RMSE	Crime point stays in same ward (%)	Home point stays in same ward (%)	Note
Reduced model (drop prior-crime term)	Geomask 50-300m	0.015	79.8	80	Used as appendix robustness check because it drops the prior-location mechanism.

In that reduced model, the best masking method was Geomask 50-300m with an RMSE of about 0.015, which is slightly more stable numerically than the main 1 km familiarity model. However, because it drops the prior-location mechanism altogether, it is treated as a robustness check rather than the preferred substantive specification.

Part I is based on the output set in outputs/chennai_masking_alternatives/prior_within_1km/. The dropped-prior specification in outputs/chennai_masking_alternatives/reduced_no_prior/ is included only as an appendix robustness check.

PART II: Pseudonymization, PV-Number Consistency & Cross-Dataset Linking

This section demonstrates how the anonymization algorithm handles personal identifiers -- PV numbers, RRN (Rijksregisternummer), and person attributes -- so that multiple police datasets can still be linked by researchers after anonymization, without exposing identity.

Cross-zone standardisation - a core motivation for algorithmic anonymization: Belgian police data is produced by approximately 187 local police zones, each with its own database infrastructure and extraction workflow. Without an algorithmic approach, zones anonymize data inconsistently - one zone may redact birth year while another retains it; one may suppress nationality while another codes it differently. Aggregated research across zones then carries systematic measurement error invisible to the researcher. The consistent keyed pseudonymization demonstrated in this section eliminates this problem: HMAC(key, RRN) produces the same pseudonym for the same person regardless of which zone extracted the record. Cross-zone identity linkage becomes reliable and cross-zone statistical comparisons become valid - with no personal data exchange between zones required before pseudonymization.

Publicly documented Belgian criminal-justice data environments extend beyond a single police register. They include local police operational systems such as ISLP/ISLP2, mobile and search layers such as FOCUS@GPI and PoliceSearch, national police reference systems centred on the ANG/BNG and the publicly referenced FEEDIS feed environment, justice systems such as JustCase and JustMask, prison systems such as Sidis Suite, and NICC research or forensic infrastructures such as DOT and be.care. The exact internal schemas of these systems are not publicly documented in full, but the linkage problem is clear: the same person, case, or event may reappear across operational silos under different local formatting conventions and in different data modalities.

These environments combine structured operational records, free-text narratives and legal documents, and temporal and geospatial event data. For that reason, the anonymization problem addressed here is not limited to removing direct identifiers from one table. A practical pipeline must standardize identifiers before transformation, apply deterministic pseudonymization to person and case keys, handle text redaction as a separate task, and protect spatial or temporal fields separately where those fields create re-identification risk. The synthetic ISLP/TAS/SRS2-style inputs used below are therefore best understood as a simplified stand-in for a broader multi-system linkage problem rather than as a claim to reproduce every operational database in full.

Core challenge in Belgian police data: Police information is stored across multiple systems such as local operational police registers, offender and victim files, national police reference systems, judicial case-management environments, prison systems, and forensic research infrastructures. Each record may carry a PV number (Proces-Verbaal) or another case identifier, together with person identifiers that allow the same individual to be linked across events and institutional contexts.

When data are prepared for research, removing names is not sufficient. PV numbers, national registry numbers (Rijksregisternummer/RRN), and date of birth can still support re-identification when they are combined.

The method demonstrated here uses consistent keyed pseudonymization so that:

The same person in Dataset A and Dataset B gets the same anonymized ID

Researchers can still link records across time and datasets

De-pseudonymization is only possible with the secret key, held by the data controller

1. Simulated Raw Police Data - Modelled on Real Belgian Police Records (Three Linked Datasets)

The raw synthetic police-style inputs used below were generated once by scripts/generate_synthetic_demo_data.R and are loaded from data_generated/. This separation was used to keep the report focused on anonymization, pseudonymization, linkage, and output interpretation rather than on repeated data creation during each render.

1.1 Helper: generate realistic Belgian identifiers

1.2 Dataset 1 - Crime Incidents (Feiten Register)

The table below shows the structure of the raw synthetic incident register before pseudonymization. The key analytical fields are the PV number, date, offence type, neighbourhood, and injury/outcome variables.

Table 8: Preview of the raw synthetic incident register (first 4 records). All data are fully synthetic.
pv_number	datum	tijd	delict_type	wijk	letsel	status
2022/GNT/05041	2023-08-04	10:15	Property crime	Wondelgem	Geen	Gesloten
2024/GNT/03266	2022-02-18	19:00	Public order / substance	Sint-Amandsberg	Geen	Gesloten
2022/GNT/05326	2022-03-12	23:30	Property crime	Mariakerke	Geen	Gesloten
2022/GNT/09504	2024-07-07	19:45	Public order / substance	Mariakerke	Geen	Doorverwezen

1.3 Dataset 2 - Offender Records (Dader Register)

Some offenders committed multiple crimes (repeat offenders - same person_id in multiple rows).

The next table highlights why deterministic pseudonymization matters: some individuals appear in multiple PV records, so the research version must preserve within-person linkage without exposing identity.

1.4 Dataset 3 - Victim Records (Slachtoffer Register)

Some victims appear in multiple incidents. Critically, some persons are BOTH offender and victim in different cases - a real pattern in interpersonal violence data.

2. The Problem: Naive Anonymization Breaks Linkage

A naive approach strips direct identifiers (name, RRN, address) and replaces each identifier with a new random ID every time. This is the most common mistake in ad-hoc anonymization.

Why this fails: Random replacement generates a different code for 2022/GNT/00341 in the offender file and a completely different code in the victim file. Researchers cannot join on PV number - the data becomes useless for cross-file analysis.

3. The Solution: Consistent Keyed Pseudonymization

We use HMAC-SHA256, which can be understood here as a standard keyed hashing method. In practice, it turns an identifier such as an RRN or PV number into a stable pseudonym using a secret key. The same input and the same key always produce the same pseudonym; without the key, the original identifier cannot be read back directly from the output.

This is the approach recommended by ENISA (ENISA, 2019) and compatible with GDPR Art. 4(5) pseudonymization.

3.1 Apply consistent pseudonymization to all three datasets

3.2 Preview: what the pseudonymized data looks like

The researcher-facing files shown below retain analytical fields and stable pseudonyms, but remove direct identifiers such as names, RRN, and raw PV numbers.

Table 9: Preview of the pseudonymized offender file (first 5 records)
pv_pseudo	person_pseudo	person_pseudo_univ	geslacht	nationaliteit	burgelijke_staat	rol	gekend_bij_pz
PV-841A5F319D8B	D-F23B9BA25B6E	PRS-F23B9BA25B6E	V	Duits	Samenwonend	Verdachte	Nee
PV-85A517390E8F	D-523BA24A1825	PRS-523BA24A1825	M	Congolees	Samenwonend	Verdachte	Ja
PV-7E79B32DE9FE	D-1895C0DB1C22	PRS-1895C0DB1C22	M	Roemeens	Samenwonend	Verdachte	Nee
PV-01FB2838F77F	D-80C0220A9292	PRS-80C0220A9292	V	Nederlands	Gescheiden	Verdachte	Nee
PV-CE1B202E69F3	D-AC4CC429F7F3	PRS-AC4CC429F7F3	V	Belgisch	Gehuwd	Verdachte	Ja

Table 10: Preview of the pseudonymized victim file (first 5 records)
pv_pseudo	person_pseudo	person_pseudo_univ	geslacht	nationaliteit	burgelijke_staat	rol	relatie_dader
PV-BDD76BBF3F1C	S-F71BABA98A2E	PRS-F71BABA98A2E	M	Duits	Weduwe/Weduwnaar	Slachtoffer	Kennis
PV-C19BF58E43D7	S-ACF088B1C241	PRS-ACF088B1C241	M	Pools	Gehuwd	Slachtoffer	Onbekend
PV-661E861336D7	S-AB252BC3C6E5	PRS-AB252BC3C6E5	M	Frans	Gehuwd	Slachtoffer	Collega
PV-B87EB997C714	S-7DF963006DA7	PRS-7DF963006DA7	M	Duits	Ongehuwd	Slachtoffer	Familielid
PV-7053CD96B401	S-3D6FEB2A6CA2	PRS-3D6FEB2A6CA2	M	Duits	Gescheiden	Slachtoffer	Kennis

4. Demonstrate: Cross-Dataset Linking Still Works

The core claim of consistent keyed pseudonymization is that all analytical joins work identically before and after anonymization. The scorecard below verifies this explicitly.

Linkage integrity scorecard: the intended analytical joins are preserved after pseudonymization in this synthetic demonstration
Join type	Raw data	After pseudonymization	Preserved?
Incidents row count	120	120	✓ 100%
Unique PV numbers (incidents)	120	120	✓ 100%
Unique offender persons (by RRN/pseudonym)	42	42	✓ 100%
Offender → Incident joins (PV number)	85	85	✓ 100%
Offender → Victim joins (same person)	4	4	✓ 100%

4.1 Link offenders to incident details via PV pseudonym

Table 11: Offenders linked to their incident record via PV pseudonym (first 6 rows)
PV pseudonym	Person pseudonym	Gender	Nationality	Prior PVs	Known to police	Date	Offence type	Neighbourhood	Injury
PV-841A5F319D8B	PRS-F23B9BA25B6E	V	Duits	0	Nee	2021-07-01	Property crime	Muide	Zwaar
PV-85A517390E8F	PRS-523BA24A1825	M	Congolees	1	Ja	2022-07-28	Public order / substance	Sint-Amandsberg	Geen
PV-7E79B32DE9FE	PRS-1895C0DB1C22	M	Roemeens	0	Nee	2022-02-18	Public order / substance	Sint-Amandsberg	Geen
PV-01FB2838F77F	PRS-80C0220A9292	V	Nederlands	0	Nee	2023-11-21	Public order / substance	Bloemekenswijk	Licht
PV-CE1B202E69F3	PRS-AC4CC429F7F3	V	Belgisch	1	Ja	2021-08-31	Property crime	Muide	Geen
PV-B87EB997C714	PRS-4700C7A200C8	M	Congolees	1	Nee	2022-12-23	Public order / substance	Sint-Amandsberg	Licht

4.2 Find the same person across offender AND victim datasets

This is the most sensitive use case: a person who was an offender in one case and a victim in another. Using the universal person pseudonym (person_pseudo_univ), researchers can trace this without ever knowing who the person is.

Table 12: Persons with BOTH offender and victim roles — linked by universal pseudonym without knowing their identity
Person pseudonym	Offender PV	Role (offender)	Victim PV	Role (victim)	Relation to suspect
PRS-64439954BA3B	PV-8CAE09F42E9D	Verdachte	PV-ED87DE7C95D7	Slachtoffer	Onbekend
PRS-D2811785D27F	PV-EF8DB40B5816	Verdachte	PV-24464F55C289	Slachtoffer	Onbekend
PRS-D2811785D27F	PV-EF8DB40B5816	Verdachte	PV-23FFB7E4AD3D	Slachtoffer	Onbekend
PRS-1E4DC75EB6FF	PV-6DFFC4C79251	Verdachte	PV-21358D6B89AD	Slachtoffer	Partner

4.3 Track a repeat offender across multiple PV numbers

Table 13: Criminal history of offender PRS-12DC66B62F83 — identity unknown to researcher
Person pseudonym	PV pseudonym	Date	Offence type	Neighbourhood	Gender	Nationality	Prior PVs
PRS-12DC66B62F83	PV-795D706A47D6	2021-05-23	Violence / interpersonal	Muide	M	Belgisch	0
PRS-12DC66B62F83	PV-661E861336D7	2021-08-01	Violence / interpersonal	Sint-Amandsberg	M	Belgisch	3
PRS-12DC66B62F83	PV-CC570D1B8B80	2021-08-02	Violence / interpersonal	Gentbrugge	M	Belgisch	3
PRS-12DC66B62F83	PV-E871AF0588D2	2023-08-19	Violence / interpersonal	Mariakerke	M	Belgisch	1
PRS-12DC66B62F83	PV-7B70B46C910C	2023-09-02	Property crime	Ledeberg	M	Belgisch	1
PRS-12DC66B62F83	PV-A9921AD699D6	2023-11-13	Public order / substance	Gentbrugge	M	Belgisch	3
PRS-12DC66B62F83	PV-9A654E2DCB93	2024-04-04	Property crime	Gentbrugge	M	Belgisch	2

5. Re-identification Risk Analysis

Even after removing direct identifiers, quasi-identifiers (age group, nationality, marital status, gender together) can make individuals unique - the combination re-identifies them (Sweeney 2002: 87% of Americans uniquely identified by ZIP + DOB + sex).

5.1 Uniqueness by quasi-identifier combination

Table 14: Re-identification risk in the raw offender and victim files
Dataset	Records	Unique combos	% unique	Small group (n)	% small group
Offenders (raw)	85	9	30%	39	46%
Victims (raw)	110	9	21%	63	57%

Table 15: Re-identification risk after pseudonymization
Dataset	Records	Unique combos	% unique	Small group (n)	% small group
Offenders (pseudonymized)	85	9	30%	39	46%
Victims (pseudonymized)	110	9	21%	63	57%

5.2 k-Anonymity check per nationality/age group cell

These two summary tables show that pseudonymization removes direct identifiers, but does not on its own eliminate all uniqueness risk in quasi-identifier combinations. That is why the k-anonymity check below is still needed.

## No rows in ka_check - skipping plot.

5.3 Apply k-anonymity suppression (k >= 5 threshold)

## Offenders - nationality-generalised records: 55 of 85 
## Offenders - age-group-suppressed records:   0 of 85

## Victims - nationality-generalised records: 83 of 110 
## Victims - age-group-suppressed records:   0 of 110

6. End-to-End Research Workflow

This is the complete pipeline a researcher receives. They have no access to real identifiers - only pseudonymous IDs and generalised attributes - yet they can perform full longitudinal and cross-dataset analysis.

Figure 11: Offence type by age group derived from the pseudonymized researcher dataset (Part II). This chart demonstrates that cross-variable analysis remains fully possible: offender age groups (from pseudonymized records) are linked to incident crime types via the PV pseudonym, without any direct identifier being present in the data.

Figure 12: Crime type by neighbourhood heatmap derived from the pseudonymized researcher dataset (Part II). Full analytical detail is preserved: no direct identifiers remain, yet cross-variable tabulation across neighbourhoods and offence types is unimpeded.

7. Additional Release Checks

This section keeps only the additional release checks that are not already visible in the linkage tables above. The core point is straightforward: deterministic pseudonymization preserves joins, but a research release still needs decisions about timestamp precision and free-text redaction before it can be shared safely.

7.1 Temporal precision

Exact timestamps can become identifying when they are combined with offence category, neighbourhood, or person-level attributes. For that reason, the practical question is not whether time should be dropped entirely, but how far it should be coarsened before release.

Table 16: Temporal precision vs. re-identification risk. Coarsening timestamps to time-of-day bands substantially reduces uniqueness while preserving analytically relevant crime patterns.
Precision level	Unique combinations	Total records	% unique	Re-id risk
Exact (date + HH:MM)	119	120	99.2	High
Date + hour of day (24)	118	120	98.3	High
Date + time band (4)	117	120	97.5	High
Week + time band	108	120	90.0	High
Month + time band	87	120	72.5	High

7.2 Narrative fields

Structured pseudonymization is not enough when police records include narrative text. Names, RRNs, addresses, and times often remain embedded in natural language fields and therefore require a separate redaction step before a research extract can be released.

Table 17: Free-text narrative anonymization — raw police narrative (left) vs. redacted output (right). In production, Dynizer's neuro-symbolic AI performs entity identification with substantially higher recall than the regex baseline shown here.
Record ID	Raw narrative	Redacted narrative
PV-A1B2	Op 15/03/2022 om 23:45 werd Jan De Smedt, RRN 86.04.12-234.71, wonende Langestraat 42, 9000 Gent, aangetroffen.	Op [DATUM] om [TIJD] werd [NAAM], RRN [RRN], wonende [ADRES], [POSTCODE], aangetroffen.
PV-C3D4	Het voertuig werd bestuurd door Mohamed El Amrani, geboortedatum 04/04/1990, RRN 90.04.04-123.45, uit Merelbeke.	Het voertuig werd bestuurd door [NAAM], geboortedatum [DATUM], RRN [RRN], uit Merelbeke.
PV-E5F6	Slachtoffer Emma Vandenberghe (geb. 12-06-1978), verblijvend te Veldstraat 18, Gent, deed aangifte van diefstal.	[NAAM] (geb. [DATUM]), verblijvend te [ADRES], Gent, deed aangifte van diefstal.
PV-G7H8	Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om 02:30.	Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om [TIJD].

Interpretation: The regex example is only a baseline used to illustrate the transformation logic. In practice, entity detection in free text needs a stronger NLP layer because names, addresses, and times appear in many formats that simple patterns do not capture reliably.

8. GDPR Compliance, Project Challenges & Research Approach

8.1 Legal Framework: GDPR, LED 2016/680 & Wet op het Politieambt

Table 18: Legal compliance mapping — GDPR, LED 2016/680 & Wet op het Politieambt (WPA)
Mechanism	Identifier addressed	Legal basis (GDPR / LED 2016/680 / WPA)
HMAC-SHA256 pseudonymization	PV number, Person ID, RRN	GDPR Art. 4(5); LED Art. 3(b) — pseudonymization; key held by controller
Age group (not exact DOB)	Date of birth	GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — data minimisation
Nationality retained	Nationality	GDPR Art. 9(1); LED Art. 10 — special categories in criminal justice context
Address removed	Street, house number	GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — minimum necessary data
k-Anonymity suppression	QI combinations	GDPR Art. 89(1); LED Art. 4(e) — proportionate technical safeguards for research
Universal person pseudonym	Cross-dataset identity	GDPR Rec. 26; LED Rec. 26 — re-identification must not be reasonably possible
Key held separately	De-pseudonymization	GDPR Art. 25; LED Art. 20; WPA Art. 44/7 — data protection by design

Key point for the UGent Crime Lab project: The consistent pseudonymization key is the technical core of the anonymization algorithm. It must be: - Generated once per data release by the data controller (local police zone) - Stored in an HSM or certified key vault - never in the research environment - Rotated per researcher batch to prevent cross-batch re-identification - Audited (GDPR Art. 30 processing register)

Wet op het Politieambt (WPA) -- Belgian Police Act: Belgian police data is additionally governed by the Wet op het Politieambt (WPA, Belgisch Staatsblad 22 December 1992, as amended). Art. 44/1 WPA requires that personal data collected in the performance of police duties be accurate, adequate, relevant and not excessive. Art. 44/3 WPA mandates retention periods proportionate to purpose. Art. 44/7 WPA grants data subjects rights of access and correction. The pseudonymization pipeline directly operationalises the WPA data minimisation requirement: only the minimum attributes needed for scientific analysis are transferred to the researcher; all surplus personal data is removed or transformed before the data leaves the police system.

LED 2016/680 -- EU Law Enforcement Directive: Police data held for law enforcement purposes falls under Directive (EU) 2016/680 (transposed in Belgium by the Law of 30 July 2018), not the GDPR directly. Key distinctions from GDPR: lawfulness of processing derives from national law (LED Art. 8, not GDPR Art. 6); research access requires a documented scientific purpose with minimum-necessary data (LED Art. 4(2)); special categories (ethnic origin, health, criminal history) are subject to LED Art. 10. The compliance table above maps each mechanism to both its GDPR analogue and the corresponding LED / WPA article.

8.2 Constraints Demonstrated in This Report

The report does not demonstrate every operational challenge of a full police-data release system, but it does show three constraints that remain central in practice. First, spatial masking does not affect every criminological analysis in the same way: in Part I the logdistance and prior_crime_within_1km_any terms are more sensitive to point displacement than several ward-level opportunity covariates, so masking has to be validated against the target analysis rather than chosen abstractly. Second, linked files remain usable only when pseudonymization is deterministic, because joins fail if key use or identifier formatting varies across extracts. Third, even after direct identifiers are removed, linked researcher files can still contain rare quasi-identifier combinations, which is why suppression and coarsening remain necessary before release.

8.3 How My Background Is Relevant to These Constraints

The relevance of my background to this project lies less in claiming a complete production solution and more in bringing together the parts already demonstrated here: spatial criminological analysis, reproducible implementation, and structured disclosure-control thinking.

Spatial criminology, scale, and model sensitivity Work on crime location choice and related spatial criminological questions is directly relevant to the first constraint above. It also includes attention to how changing spatial scale affects criminological interpretation. That matters because anonymization decisions should be tied to analytical consequences, not only to abstract privacy principles.

Linked-data handling and pseudonymization logic Experience with crime data and familiarity with police-style file structures are relevant to the second constraint. A linked release is only useful when identifiers are handled consistently across files and over time. The pseudonymization section of this report was built to show exactly that issue: the technical task is not only to hide names, but to preserve the joins researchers actually need.

Reproducible implementation and risk auditing Work in R, sf, simulation, and reproducible workflows is relevant to the third constraint because quasi-identifier risk is not something that should be checked informally. It has to be implemented, measured, documented, and rerun when release conditions change. That is the part of the project where statistical reasoning and implementation work meet most clearly.

All names, RRN numbers, and case details in this section are fully synthetic. No real persons are represented.

References

Academic sources

Bernasco, W., & Nieuwbeerta, P. (2005). How do residential burglars select target areas? British Journal of Criminology, 45(3), 296-315. https://doi.org/10.1093/bjc/azi005

European Union Agency for Cybersecurity. (2019). Pseudonymisation techniques and best practices: Recommendations on shaping technology according to data protection and privacy provisions. Publications Office of the European Union.

Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23(7), 1025-1044. https://doi.org/10.1068/a231025

Kuralarasan, K., & Bernasco, W. (2022). Location choice of snatching offenders in Chennai City. Journal of Quantitative Criminology, 38, 673-696. https://doi.org/10.1007/s10940-021-09514-9

Openshaw, S. (1984). The modifiable areal unit problem. Geo Books.

Song, G., Bernasco, W., Liu, L., Xiao, L., Zhou, S., & Liao, W. (2017). Crime feeds on legal activities: Daily mobility flows help to explain thieves' target location choices. Journal of Quantitative Criminology, 33(4), 831-854. https://doi.org/10.1007/s10940-016-9326-0

Steenbeek, W., & Weisburd, D. (2016). Where the action is in crime? An examination of variability of crime across different spatial units in The Hague, 2001-2009. Journal of Quantitative Criminology, 32(3), 449-469. https://doi.org/10.1007/s10940-015-9276-3

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570. https://doi.org/10.1142/S0218488502001648

Vandeviver, C., Van Daele, S., & Vander Beken, T. (2015). What makes long crime trips worth undertaking? Balancing costs and benefits in burglars' journey to crime decisions. British Journal of Criminology, 55(2), 399-420. https://doi.org/10.1093/bjc/azu093

Weisburd, D., Groff, E. R., & Yang, S.-M. (2012). The criminology of place: Street segments and our understanding of the crime problem. Oxford University Press.

Legal sources

Belgian Federal Government. (1992). Wet op het Politieambt [Belgian Police Act]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

Belgian Federal Government. (2018). Wet van 30 juli 2018 betreffende de bescherming van natuurlijke personen met betrekking tot de verwerking van persoonsgegevens [Law of 30 July 2018 on the protection of natural persons with regard to the processing of personal data]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

European Parliament and Council of the European Union. (2016a). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88. https://eur-lex.europa.eu/eli/reg/2016/679/oj

European Parliament and Council of the European Union. (2016b). Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data. Official Journal of the European Union, L 119, 89-131. https://eur-lex.europa.eu/eli/dir/2016/680/oj

Crime Location Choice, Spatial Anonymization, and Pseudonymization

An empirical Chennai masking application and a Belgian police-style linkage demonstration

Dr. Kuralarasan Kumar

March 2026