Scope of the report

Police-derived crime data always carries two disclosure problems at once: sensitive locations that reveal where crimes happened and where offenders live, and identifiers — PV numbers, RRN, person attributes — that allow the same individual to be traced across files and datasets. In practice, any anonymization pipeline must handle both simultaneously: masking locations without destroying the analytical model, and pseudonymizing identifiers without breaking the cross-file joins that make linked-data research possible.

This document demonstrates that both problems can be addressed within a reproducible workflow:

Part I shows what happens to analytical results when location data is masked — i.e., does the model still give the same answers after spatial transformation?
Part II shows how person and case identifiers can be transformed so that cross-file joins still work for researchers — i.e., can you still link offender to victim to incident after pseudonymization?
Part III situates the demonstrated techniques within the applicable regulatory framework (GDPR, LED 2016/680, Wet op het Politieambt), identifies the principal methodological constraints that any operational police-data release must address, and maps the researcher's competencies onto those constraints.

Together they form the analytical and technical foundation of a pipeline that is both privacy-protective and analytically valid.

1. Chennai Spatial Masking Analysis

Part I applies four spatial masking methods — two grid aggregation levels and two geomasking radii — to a real street robbery dataset from Chennai, India. Each method shifts or aggregates crime incident points and offender home locations to reduce location disclosure risk. The masked datasets are then used to re-estimate the crime location choice model, and the results are compared against the unmasked baseline. The central question is how much the model coefficients change: a small deviation means the masked data can still support the same analytical conclusions, while a large deviation means the transformation has distorted the results beyond practical use.

1.1. Chennai baseline and comparison logic

The baseline follows the Model 2 specification from the published Chennai snatching study (Kuralarasan & Bernasco, 2022, Table 3), with one adaptation: the same-ward prior-crime indicator is replaced by a boundary-free 1 km familiarity indicator. This change makes the model less sensitive to arbitrary administrative boundaries once crime and home points are spatially masked. Table 1.1 below ranks the tested methods by overall deviation from the unmasked result.

This comparison is also relevant from the perspective of spatial units of analysis. A substantial literature on the modifiable areal unit problem shows that model coefficients can change when phenomena are aggregated into different spatial units, especially when administrative boundaries do not align well with the behavioral processes under study (Fotheringham & Wong, 1991; Openshaw, 1984). In spatial analysis specifically, variation across spatial units is often large enough to alter interpretation (Steenbeek & Weisburd, 2016; Weisburd et al., 2012). For that reason, spatial masking should be evaluated not only as a privacy intervention but also as a change in the spatial representation through which offender decision-making is measured.

The prior-crime familiarity indicator is especially sensitive to administrative boundary dependence. A same-ward indicator is administratively convenient, but it registers a boundary crossing as a meaningful change in offender familiarity even when the masked point remains geographically close to its original position. A boundary-free 1 km indicator is therefore more robust to small positional shifts and more consistent with arguments in spatial criminology that offenders' awareness spaces and relevant opportunity structures do not necessarily coincide with administrative borders (Bernasco & Nieuwbeerta, 2005; Song et al., 2019).

Table 1.1: Masking methods vs baseline
Method	RMSE	Max abs. bias	Most affected covariate	Crime point in same ward (%)	Home point in same ward (%)
Grid 250m	0.017	0.038	Marriage halls (per 10)	86.5	87.2
Geomask 50-300m	0.028	0.075	Mosques (per 10)	79.0	79.3
Grid 500m	0.040	0.124	Any prior crime within 1 km (0,1)	78.7	79.1
Geomask 200-600m	0.084	0.336	Any prior crime within 1 km (0,1)	59.0	59.2
Grid 1000m	0.091	0.224	Mosques (per 10)	58.1	62.6
Geomask 400-1200m	0.114	0.496	Any prior crime within 1 km (0,1)	34.5	32.0

Table 1.1 is a joint summary of overall model drift and spatial reassignment stability. RMSE reports the root mean square error of coefficient deviation from the unmasked model, while Max abs. bias shows the single largest coefficient shift within each masking method. The Most affected covariate column identifies which coefficient is most distorted, and the two ward-stability columns show how often masked crime points and home points remain in their original wards after masking — together, these columns reveal not only which method ranks best overall, but also why some methods degrade more quickly than others.

The best-performing masking method in this analysis is Grid 250m. It has the lowest RMSE (0.017), the smallest maximum single-term bias (0.038), and the highest ward stability for both crime points and home points (about 86%–87%). The second-ranked method remains relatively close to the baseline, but already shows lower ward stability (about 79%) and a larger maximum bias. The next masking level still preserves the broad substantive pattern, but it is the first method where the boundary-free familiarity term becomes the worst-affected coefficient, indicating that the model starts to feel the effects of more frequent spatial reassignment. The wider masking settings show the same pattern more clearly: once ward stability drops into the 50% range and below, coefficient distortion increases substantially even though the general direction of the main findings remains recognizable.

1.2. Baseline vs best masking method

Table 1.2 compares the unmasked baseline directly against the best-performing masking method. It reports odds ratios, p-values, and the percentage change in the odds ratio. This is the most direct way to judge whether masking changes the substantive interpretation of the Chennai model under the boundary-free familiarity specification.

Table 1.2: Baseline vs best masking method (Grid 250m)
Term	Baseline OR	Baseline p	Masked OR	Masked p	OR diff.	% OR change
Distance (km)	0.356	0.000	0.350	0.000	-0.007	-1.832
Any prior crime within 1 km (0,1)	10.018	0.000	9.741	0.000	-0.277	-2.767
Area (km2)	1.075	0.000	1.092	0.000	0.017	1.576
Population (per 1,000)	1.004	0.208	1.003	0.403	-0.001	-0.132
Retail stores (per 10)	1.010	0.358	1.013	0.261	0.002	0.228
Transit stations (per 10)	1.039	0.432	1.031	0.524	-0.008	-0.727
Mosques (per 10)	1.073	0.450	1.093	0.341	0.020	1.859
Temples (per 10)	1.009	0.760	0.980	0.494	-0.029	-2.916
Churches (per 10)	1.182	0.000	1.161	0.000	-0.021	-1.757
Education institutions (per 10)	1.072	0.000	1.066	0.000	-0.005	-0.507
School and college (per 10)	0.989	0.669	0.994	0.814	0.005	0.495
Personal care (per 10)	1.066	0.003	1.052	0.021	-0.015	-1.379
Hospitals (per 10)	0.999	0.950	1.000	0.994	0.001	0.113
Marriage halls (per 10)	1.123	0.004	1.168	0.000	0.044	3.921
Jewelleries (per 10)	1.025	0.163	1.035	0.045	0.010	1.003
Textiles (per 10)	0.987	0.128	0.984	0.058	-0.003	-0.331
Park (per 10)	1.176	0.017	1.164	0.024	-0.012	-1.018
Recreation facilities (per 10)	0.957	0.124	0.984	0.556	0.026	2.743
Restaurant (per 10)	1.028	0.033	1.035	0.009	0.006	0.612
Government office (per 10)	1.058	0.013	1.062	0.007	0.004	0.375

The main findings remain stable under the best masking method. The distance effect remains negative and strong, and the boundary-free 1 km familiarity effect remains strongly positive. In the HTML version, the masked p-value column is colour-coded by significance stability relative to the baseline, while the absolute difference and percentage change in the odds ratio are shaded by the size of the deviation. The core criminological interpretation therefore does not reverse under the best-performing masking specification.

1.3. Coefficient Patterns and Overall Deviation

The two figures below complement Tables 1.1 and 1.2 by showing the same masking results at two different levels of summary. Figure 1.1 is coefficient-specific: it shows how each odds ratio moves across masking methods relative to the unmasked baseline, marked by the dashed line in each panel. This makes it possible to see which terms remain tightly clustered across methods and which terms drift more clearly as masking becomes stronger. Figure 1.2 then collapses that information into a single method-level ranking using RMSE, so it should be interpreted as a compact summary of overall deviation rather than as a substitute for the term-by-term comparison in Table 1.2.

Figure 1.1: Odds ratios by masking method

Figure 1.2: RMSE ranking of masking methods

Taken together, the figures reinforce the pattern already visible in the tables. The smaller masking settings remain much closer to the unmasked baseline, while the wider geomasking and coarser grid methods introduce visibly larger coefficient movement. The coefficient plot also shows that distortion is not evenly distributed across terms: some ward-level opportunity covariates remain comparatively stable, whereas the familiarity term and a smaller number of place-based covariates become more sensitive as spatial reassignment becomes more common. The RMSE ranking in Figure 1.2 is therefore best interpreted as a summary of a broader pattern already visible in Figure 1.1, not as an isolated performance score.

1.4. Robustness Check Without a Prior-Location Covariate

This section reports a reduced model excluding the prior-location covariate. The main Chennai specification retains a boundary-free prior-familiarity covariate based on whether the offender had previously offended within 1 km of the candidate location, because that measure is less sensitive to arbitrary boundary crossings than a same-ward indicator. A reasonable concern, however, is that the overall masking results might partly depend on that modeling choice. To address that concern, I also estimated a reduced specification that drops the prior-location covariate entirely and re-runs the same masking comparison on the reduced model.

This reduced model is not treated as the preferred substantive specification. It removes an important mechanism of spatial familiarity and therefore answers a more limited question: if that mechanism is omitted altogether, do the masking results still show the same broad ranking pattern? Read in that way, the reduced model is a robustness check on the masking comparison, not a replacement for the main model. Table 1.3 presents the results.

Table 1.3: Robustness (reduced model)
Robustness specification	Best masking method	Best RMSE	Crime point stays in same ward (%)	Home point stays in same ward (%)
Reduced model (drop prior-crime term)	Geomask 50-300m	0.015	79.8	80

In that reduced model, the best masking method was Geomask 50-300m with an RMSE of about 0.015, which is slightly lower than the main 1 km familiarity model. That lower RMSE should not be over-interpreted as evidence that the reduced model is substantively better. With one less behaviorally important covariate to preserve, the reduced specification is simply easier to reproduce after masking. The important point is that the smaller masking settings still perform best, while the wider masking settings still introduce visibly more distortion. In other words, the broad masking pattern does not disappear when the prior-location covariate is omitted, but the main specification remains preferable because it preserves a more meaningful behavioural mechanism.

1.5. Part I Summary

Part I is a baseline-deviation study: it measures how far the masked model moves away from the original fitted result, rather than attempting to recover true population parameters. The main result is that smaller masking settings preserve the original model coefficients most effectively: Grid 250m performs best, Geomask 50-300m remains close, and wider masking settings introduce progressively larger distortion. The reduced-model robustness check supports the same broad conclusion: omitting the prior-location covariate changes the preferred specification, but it does not overturn the overall ranking pattern in which smaller masking settings are analytically safer. The practical implication is therefore straightforward: spatial masking can remain compatible with the original analytical conclusion, but only within a relatively limited range of spatial displacement or aggregation.

2. Pseudonymization, PV-Number Consistency & Cross-Dataset Linking

Belgian police data spans roughly 196 local police zones and multiple institutional systems — local operational registers (ISLP/ISLP2), national reference systems (ANG/BNG, FEEDIS), justice environments (JustCase, JustMask), prison systems (Sidis Suite), and forensic research infrastructures (DOT, be.care). The same person, case, or event can therefore appear across these separate systems under different local formatting conventions. Without an algorithmic approach, zones anonymize data inconsistently: one zone may redact birth year while another retains it; one may suppress nationality while another codes it differently. Aggregated research then carries systematic measurement error invisible to the researcher, and cross-dataset joins on PV numbers or person identifiers break silently.

Removing names is not sufficient to address this. PV numbers, the national registry number (Rijksregisternummer/RRN), and date of birth together can still support re-identification, and stripping them with a fresh random ID each time destroys the cross-file joins that make linked-data research possible. This section demonstrates that consistent keyed pseudonymization — applying HMAC(key, RRN) deterministically so the same person always receives the same pseudonym regardless of which zone or system extracted the record — solves both problems simultaneously: direct identifiers are removed and cross-dataset linkage is preserved. De-pseudonymization is only possible with the secret key, held by the data controller.

2.1. Three Synthetic Belgian Police-Style Datasets

The three synthetic datasets used below — crime incidents, offender records, and victim records — were generated to resemble Belgian police/crime data. All fields, identifiers, and record structures follow Belgian police data conventions, but no real personal data are used.

2.1.1 Synthetic Incident Register (Dataset 1)

Table 2.1 shows the structure of the synthetic incident register before pseudonymization. It represents recorded criminal incidents registered through PV-based workflows in a Belgian police data structure. The important analytical fields are the PV number, date, offence type, neighbourhood, and injury/outcome variables.

Table 2.1: Synthetic incident preview (first 4)
PV Number	Date	Offence Type	Neighbourhood	Injury
2022/GNT/05041	2023-08-04	Property crime	Wondelgem	Geen
2024/GNT/03266	2022-02-18	Public order / substance	Sint-Amandsberg	Geen
2022/GNT/05326	2022-03-12	Property crime	Mariakerke	Geen
2022/GNT/09504	2024-07-07	Public order / substance	Mariakerke	Geen

Taken together with the person-level files below, this incident register provides the case-level anchor for later linkage checks.

2.1.2 Offender Records — Dader Register (Dataset 2)

Some offenders committed multiple crimes (repeat offenders - same person_id in multiple rows).

Table 2.2 is a person-event file: the same person can recur across multiple PV records, making linkage preservation a central requirement of the anonymization pipeline.

Table 2.2: Synthetic offender preview (first 4)
Person ID	PV Number	Gender	Nationality	Role
D0057	2021/GNT/03774	V	Duits	Verdachte
D0044	2022/GNT/03783	M	Congolees	Verdachte
D0028	2024/GNT/03266	M	Roemeens	Verdachte
D0041	2024/GNT/09571	V	Nederlands	Verdachte

2.1.3 Victim Records — Slachtoffer Register (Dataset 3)

Some victims appear in multiple incidents. Critically, some persons are BOTH offender and victim in different cases — a real pattern in interpersonal violence data.

Table 2.3 mirrors the offender file from the victim side, showing that the eventual release logic has to preserve not only offender-to-incident joins, but also cross-role person linkage across files.

Table 2.3: Synthetic victim preview (first 4)
Person ID	PV Number	Gender	Nationality	Relation to Suspect
S0016	2024/GNT/05463	M	Duits	Kennis
S0037	2022/GNT/06644	M	Pools	Onbekend
S0061	2023/GNT/08088	M	Frans	Collega
S0066	2024/GNT/03176	M	Duits	Familielid

2.2. Linkage Failure Under Naive Anonymization

A naive approach strips direct identifiers (name, RRN, address) and replaces each identifier with a new random ID every time. This is the most common mistake in ad-hoc anonymization.

Why this fails: Random replacement generates a different code for 2022/GNT/00341 in the offender file and a completely different code in the victim file. Researchers cannot join on PV number — the data becomes useless for cross-file analysis.

The root cause is non-determinism: because each call to sample() is independent, the same input value produces a different output in each file. The fix is not to add more redaction — it is to replace randomness with a deterministic function: one that always maps the same input to the same output, using a secret key that only the data controller holds. Table 2.2b quantifies this failure using actual join counts.

Table 2.2b: Naive anonymization linkage failure
Join type	Raw data	After naive anonymization	Result
Offender → Victim (shared PV number)	54 shared PV numbers	0 matched rows	✗ Complete linkage failure

2.3. Consistent Keyed Pseudonymization as a Solution

I propose to use HMAC-SHA256, which can be understood here as a standard keyed hashing method. It turns personal identifiers such as an RRN or PV number into a stable pseudonym using a secret key applied consistently across all files (see §2.2 for why this determinism is the core requirement). The resulting output is a 24-character hexadecimal token: collision-resistant and non-invertible without the key.

A terminological point is important here. Under GDPR Recital 26 and Art. 4(5) (as reiterated in EDPB Guidelines 01/2025 on Pseudonymisation), data is pseudonymized — not anonymized — as long as a key exists that could link the pseudonym back to the original identifier. This pipeline is therefore correctly described as a privacy-preserving transformation, not full anonymization: it replaces direct identifiers with stable pseudonyms while keeping the key under the exclusive control of the data controller. Full anonymization would require the key to be permanently destroyed, which would also permanently destroy the ability to audit or correct the released data. The approach here is deliberately and correctly pseudonymization.

ENISA discusses keyed-hash / HMAC-style approaches as valid pseudonymization techniques (ENISA, 2019), and this implementation is compatible with GDPR Art. 4(5).

The same HMAC key is applied to all three datasets. Each direct identifier — RRN, PV number, name, address — is either pseudonymized or removed. The researcher-facing files retain only analytical fields and stable pseudonyms.

2.3.1 Structure of the Pseudonymized Research Files

The researcher-facing files shown below retain analytical fields and stable pseudonyms, but remove direct identifiers such as names, RRN, and raw PV numbers.

Table 2.4: Pseudonymized offender preview (first 5)
PV Pseudonym	Person Pseudonym	Gender	Nationality	Role
PV-841A5F319D8B609FB41C7E8C	D-F23B9BA25B6EC33EF0F3BD2C	V	Duits	Verdachte
PV-85A517390E8F41928753A119	D-523BA24A18253BDFC7A8C87D	M	Congolees	Verdachte
PV-7E79B32DE9FE421A4CA45E08	D-1895C0DB1C22DB37728E97CE	M	Roemeens	Verdachte
PV-01FB2838F77F4CBDEAD4396E	D-80C0220A9292AD1AE3170D97	V	Nederlands	Verdachte
PV-CE1B202E69F37EAF63F9FD98	D-AC4CC429F7F3D5489E14522B	V	Belgisch	Verdachte

Table 2.5: Pseudonymized victim preview (first 5)
PV Pseudonym	Person Pseudonym	Gender	Nationality	Role
PV-BDD76BBF3F1C5F8909C5DD9D	S-F71BABA98A2E1DA1A8D494C6	M	Duits	Slachtoffer
PV-C19BF58E43D7145B7155580B	S-ACF088B1C241F2BD6C561B79	M	Pools	Slachtoffer
PV-661E861336D7E15104CD9F69	S-AB252BC3C6E5F0B6B00863D2	M	Frans	Slachtoffer
PV-B87EB997C7140CFAA42EE97A	S-7DF963006DA72F97BF4CF224	M	Duits	Slachtoffer
PV-7053CD96B4018C7075F2DC80	S-3D6FEB2A6CA2AB892563603F	M	Duits	Slachtoffer

Tables 2.4 and 2.5 should be interpreted as structure checks: they show that the research files still contain the fields needed for analysis and linkage, but no longer expose direct personal identifiers.

2.4. Cross-Dataset Linkage Integrity Verification

The core claim of consistent keyed pseudonymization is that all analytical joins work identically before and after anonymization. Table 2.5b summarises this as a scorecard across three join types demonstrated in Sections 2.4.1–2.4.3:

Offender → Incident (Table 2.6): offender-level attributes — pseudonymous person ID, age group, nationality — linked to case-level incident data via the shared PV pseudonym (the pseudonymized case number).
Cross-role person linkage (Table 2.7): the same individual appearing as offender in one case and as victim in another, traced via the universal person pseudonym without revealing their identity.
Repeat offender history (Table 2.8): all case records attributed to one pseudonymous person, assembled in chronological order for longitudinal analysis.

This matters not only for longitudinal or cross-file analysis, but also for network analysis and intelligence-led or forensic insight, because stable person- and case-level pseudonyms allow co-involvement, repeat contacts, and event relationships to be reconstructed without exposing direct identifiers.

Table 2.5b: Linkage integrity scorecard
Join type	Raw data	After pseudonymization	Preserved?
Incidents raw count	120	120	100% (ok)
Unique PV numbers (incidents)	120	120	100% (ok)
Unique offender persons (by RRN/pseudonym)	42	42	100% (ok)
Offender → Incident joins (PV number)	85	85	100% (ok)
Offender → Victim joins (same person)	4	4	100% (ok)

The scorecard should be read by comparing each join count before and after pseudonymization. The Raw data column uses real identifiers such as RRN and PV number, whereas the After pseudonymization column uses only the derived pseudonyms. Identical counts show that the intended joins are preserved after direct identifiers are transformed. That result demonstrates linkage integrity, but it does not imply that the release is risk-free: quasi-identifier risk remains and is evaluated in the next sections.

2.4.1 Offender–Incident Linkage via PV Pseudonym

Table 2.6 shows the simplest preserved join in the release pipeline: offender attributes remain linkable to incident-level case information through the shared PV pseudonym. Two key column labels require clarification: the PV pseudonym is the pseudonymized case identifier (derived from the original PV number and stable across all files); the Person pseudonym is the pseudonymized offender identifier (derived from the RRN via HMAC-SHA256). Neither field contains any direct personal information — they serve solely as stable, researcher-safe linkage keys.

Table 2.6: Offender–incident link (sample rows)
PV pseudonym	Person pseudonym	Gender	Nationality	Prior PVs	Known to police	Date	Offence type	Neighbourhood	Injury
PV-841A5F319D8B609FB41C7E8C	PRS-F23B9BA25B6EC33EF0F3BD2C	V	Duits	0	Nee	2021-07-01	Property crime	Muide	Zwaar
PV-85A517390E8F41928753A119	PRS-523BA24A18253BDFC7A8C87D	M	Congolees	1	Ja	2022-07-28	Public order / substance	Sint-Amandsberg	Geen
PV-7E79B32DE9FE421A4CA45E08	PRS-1895C0DB1C22DB37728E97CE	M	Roemeens	0	Nee	2022-02-18	Public order / substance	Sint-Amandsberg	Geen
PV-01FB2838F77F4CBDEAD4396E	PRS-80C0220A9292AD1AE3170D97	V	Nederlands	0	Nee	2023-11-21	Public order / substance	Bloemekenswijk	Licht
PV-CE1B202E69F37EAF63F9FD98	PRS-AC4CC429F7F3D5489E14522B	V	Belgisch	1	Ja	2021-08-31	Property crime	Muide	Geen
PV-B87EB997C7140CFAA42EE97A	PRS-4700C7A200C8AB703BE079FA	M	Congolees	1	Nee	2022-12-23	Public order / substance	Sint-Amandsberg	Licht

2.4.2 Cross-Role Person Linkage: Offender and Victim Datasets

As noted in §2.1.3, some persons appear in both the offender and victim registers across different cases — the cross-role pattern characteristic of interpersonal violence data. Table 2.7 confirms that the universal person pseudonym preserves this linkage after pseudonymization: researchers can trace the same individual across roles without ever knowing who that person is.

Table 2.7: Cross-role linkage via universal pseudonym
Person pseudonym	Incident PV No. (offender role)	Role (offender)	Incident PV No. (victim role)	Role (victim)	Relation to suspect
PRS-64439954BA3B5F3509FF73C6	PV-8CAE09F42E9D407D204ED698	Verdachte	PV-ED87DE7C95D7C998D7074D5D	Slachtoffer	Onbekend
PRS-D2811785D27F7368CAFE362B	PV-EF8DB40B581698D0B476B36E	Verdachte	PV-24464F55C28975FB5AA79987	Slachtoffer	Onbekend
PRS-D2811785D27F7368CAFE362B	PV-EF8DB40B581698D0B476B36E	Verdachte	PV-23FFB7E4AD3DF5B710508392	Slachtoffer	Onbekend
PRS-1E4DC75EB6FF9F17A9B1FE74	PV-6DFFC4C792512AC2BEF2DDF5	Verdachte	PV-21358D6B89AD66EE1E0C7406	Slachtoffer	Partner

2.4.3 Repeat Offender Tracking Across Multiple PV Records

Table 2.8 extends the same logic over time: it shows that repeat involvement across multiple PV records can still be reconstructed for one pseudonymous individual, which is essential for longitudinal offending analysis.

Table 2.8: Criminal history of offender PRS-12DC66B62F83CA4CE2DDE998 — linked via universal pseudonym
Person pseudonym	PV pseudonym	Date	Offence type	Neighbourhood	Gender	Nationality	Prior PVs
PRS-12DC66B62F83CA4CE2DDE998	PV-795D706A47D6B9B8313E718D	2021-05-23	Violence / interpersonal	Muide	M	Belgisch	0
PRS-12DC66B62F83CA4CE2DDE998	PV-661E861336D7E15104CD9F69	2021-08-01	Violence / interpersonal	Sint-Amandsberg	M	Belgisch	3
PRS-12DC66B62F83CA4CE2DDE998	PV-CC570D1B8B80400F821D559F	2021-08-02	Violence / interpersonal	Gentbrugge	M	Belgisch	3
PRS-12DC66B62F83CA4CE2DDE998	PV-E871AF0588D26888F08E6816	2023-08-19	Violence / interpersonal	Mariakerke	M	Belgisch	1
PRS-12DC66B62F83CA4CE2DDE998	PV-7B70B46C910CA605295A4B82	2023-09-02	Property crime	Ledeberg	M	Belgisch	1
PRS-12DC66B62F83CA4CE2DDE998	PV-A9921AD699D6DCB60A9DA459	2023-11-13	Public order / substance	Gentbrugge	M	Belgisch	3
PRS-12DC66B62F83CA4CE2DDE998	PV-9A654E2DCB93B08E64436258	2024-04-04	Property crime	Gentbrugge	M	Belgisch	2

2.5. Re-identification Risk Analysis

Even after removing direct identifiers, quasi-identifiers (age group, nationality, marital status, gender together) can make individuals unique — the combination re-identifies them (Sweeney, 2000: 87% of Americans uniquely identified by ZIP code, date of birth, and sex).

2.5.1 Uniqueness by quasi-identifier combination

Table 2.9 reports quasi-identifier uniqueness in the raw files — before any pseudonymization — to establish a risk baseline. Table 2.10 reports the same metric in the pseudonymized files. The values are expected to be identical across both tables: pseudonymization replaces direct identifiers (names, RRN, PV numbers) but leaves quasi-identifiers — age group, nationality, marital status, and gender — unchanged. The comparison therefore confirms that pseudonymization alone does not eliminate re-identification risk through quasi-identifier combinations, which is why the k-anonymity suppression step in Section 2.5.3 remains necessary.

Table 2.9: Quasi-identifier risk (raw files)
Dataset	Records	Distinct combos	Unique records (n)	% unique	Small group (n)	% small group
Offenders (raw)	85	32	10	12%	44	52%
Victims (raw)	110	46	12	11%	74	67%

Table 2.10: Quasi-identifier risk (after pseudonymization)
Dataset	Records	Distinct combos	Unique records (n)	% unique	Small group (n)	% small group
Offenders (pseudonymized)	85	32	10	12%	44	52%
Victims (pseudonymized)	110	46	12	11%	74	67%

Tables 2.9 and 2.10 should be read comparatively. The percentages are record-level shares, not shares of distinct combination types. The identical before/after figures confirm what the section introduction above explains: pseudonymization does not alter the quasi-identifiers used in the risk calculation.

2.5.2 k-Anonymity check per nationality/age group cell

That residual uniqueness risk is why the k-anonymity check below is still needed. Figure 2.1 visualises the cell-level risk distribution by age group, making it possible to see at a glance which age groups are most exposed before suppression is applied.

Figure 2.1: k-anonymity risk by age group

2.5.3 k-Anonymity Suppression (k >= 5 Threshold)

k-Anonymity requires that every record in a release file shares its quasi-identifier combination with at least k-1 other records, making individual re-identification at most a 1-in-k probability (Sweeney, 2002). To reduce the privacy risk created by raw country labels while retaining interpretability, nationality is released in grouped categories rather than being suppressed immediately. The main release scheme is Belgian / EU (non-Belgian) / Non-EU, with a fallback to Belgian / Non-Belgian when the 3-group cell still falls below the chosen threshold in this small synthetic sample. Table 2.11 defines the release categories, their source label coverage, and the privacy and analytical rationale for each.

Table 2.11: Nationality release grouping (recommended)
Release category	Example source labels	Privacy rationale	Analytical usefulness	Recommended for small police datasets
Belgian	Belgisch	Keeps the domestic reference category while removing country-level specificity.	Preserves the key contrast between Belgian and non-Belgian records.	Yes
EU (non-Belgian)	Nederlands, Frans, Duits, Italiaans, Pools, Portugees, Roemeens, Spaans	Collapses several country labels into a broader region, reducing uniqueness from rare EU nationalities.	Retains a meaningful European mobility category without exposing exact country labels.	Yes
Non-EU	Congolees, Marokkaans, Turks	Absorbs the highest-risk rare-country labels into one broad release category.	Preserves a coarse but policy-relevant distinction for descriptive analysis.	Yes
Fallback: Non-Belgian	Applied when the 3-group cell is still below k	Further reduces small-cell risk in very small police datasets.	Provides a pragmatic fallback when the 3-group scheme remains too sparse.	Yes

Table 2.12: Release summary after suppression
Dataset	Released in 3 groups (n)	% recoded to 3 groups	Fallback to 2 groups (n)	% fallback 2-group	Age group suppressed (n)	% age suppressed
Offenders	54	63.5	31	36.5	3	3.5
Victims	64	58.2	46	41.8	1	0.9

Table 2.12 quantifies the practical cost of the final release logic. In small police datasets, grouped nationality categories often retain more analytical meaning than blanket suppression, but some records still need a fallback to Belgian versus non-Belgian and some age groups still need suppression when the corresponding age-by-gender cell remains below the threshold.

2.6. End-to-End Research Workflow

This is the complete pipeline a researcher receives. They have no access to real identifiers — only pseudonymous IDs and generalised attributes — yet they can perform full longitudinal and cross-dataset analysis.

Figure 2.2 shows the two concrete transformations the pipeline applies to offender records. The top row compares raw country-level nationality labels (Panel A) against the release categories (Panel B): most records are released under the 3-category system (Belgian / EU / Non-EU, shown in blue); records whose 3-category cell fell below k=5 are recoded to the 2-category fallback (Belgian / Non-Belgian, shown in orange). The bottom row restricts to records that had a valid age before suppression and shows which age-gender cells survive intact (Panel C) and which fall below the k=5 threshold and are withheld from the release (Panel D, grey bars). Figure 2.3 then confirms that neighbourhood-level cross-tabulation remains meaningful after the same protections are applied.

Figure 2.2: Nationality grouping and age suppression — before vs after pipeline (offender records)

We preserve neighbourhood names as analytical attributes while all person and case identifiers are pseudonymized. The heatmap below shows whether spatial cross-tabulation still works after these protections.

Figure 2.3: Offence type by neighbourhood (post-suppression)

Taken together, Figures 2.2 and 2.3 show that the release remains suitable for standard descriptive and exploratory analysis across age, offence type, and neighbourhood, even though direct identifiers have been removed and some attributes have been generalised.

2.7. Residual Risks: Checks That Pseudonymization Alone Cannot Address

Sections 2.3–2.6 demonstrate that deterministic keyed pseudonymization solves the linkage problem and eliminates direct identifiers. However, two categories of risk remain after pseudonymization is complete and require separate treatment before a research extract can be safely released. They are not deficiencies of the pipeline: they are inherent to any structured data release and must be addressed at the release-preparation stage.

2.7.1 Residual Risk 1: Temporal Precision and Date Generalisation

Exact timestamps can become identifying when combined with offence category, neighbourhood, or person-level attributes — even when all direct identifiers have been removed. This risk is structural: it arises from the precision of the data itself, not from any failure of the pseudonymization step.

Table 2.13: Temporal precision vs. re-identification risk
Precision level	Unique combinations	Total records	% unique	Re-id risk
Exact (date + HH:MM)	119	120	99.2	High
Date + hour of day (24)	118	120	98.3	High
Date + time band (4)	117	120	97.5	High
Week + time band	108	120	90.0	High
Month + time band	87	120	72.5	High

Table 2.13 should be interpreted as a trade-off table rather than a fixed rule. Its purpose is to show how quickly uniqueness drops as temporal precision is coarsened, and therefore how a data controller could justify releasing broader time bands instead of exact timestamps.

2.7.2 Residual Risk 2: Narrative Field Handling and Free-Text

Structured pseudonymization is not enough when police records include narrative text. Names, RRNs, addresses, and times often remain embedded in natural language fields and therefore require a separate redaction step before a research extract can be released. Unlike temporal precision, which can be addressed by coarsening a date field, free-text risk requires entity recognition: the system must detect that a phrase is a name or an address before it can redact it. Table 2.14 illustrates this transformation on four synthetic Dutch-language police narratives.

Table 2.14: Free-text narrative anonymization (raw vs redacted)
Record ID	Raw narrative	Redacted narrative
PV-A1B2	Op 15/03/2022 om 23:45 werd Jan De Smedt, RRN 86.04.12-234.71, wonende Langestraat 42, 9000 Gent, aangetroffen.	Op [DATUM] om [TIJD] werd [NAAM], RRN [RRN], wonende [ADRES], [POSTCODE], aangetroffen.
PV-C3D4	Het voertuig werd bestuurd door Mohamed El Amrani, geboortedatum 04/04/1990, RRN 90.04.04-123.45, uit Merelbeke.	Het voertuig werd bestuurd door [NAAM], geboortedatum [DATUM], RRN [RRN], uit Merelbeke.
PV-E5F6	Slachtoffer Emma Vandenberghe (geb. 12-06-1978), verblijvend te Veldstraat 18, Gent, deed aangifte van diefstal.	[NAAM] (geb. [DATUM]), verblijvend te [ADRES], Gent, deed aangifte van diefstal.
PV-G7H8	Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om 02:30.	Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om [TIJD].

The regex example is only a baseline used to illustrate the transformation logic. In practice, entity detection in free text needs a stronger NLP layer because names, addresses, and times appear in many formats that simple patterns do not capture reliably.

2.8. Part II Summary

Deterministic HMAC-SHA256 pseudonymization preserves all intended joins across offender, victim, and incident files; the linkage scorecard shows 100% preservation in this synthetic demo.
Canonicalization of PV numbers and RRNs prevents formatting differences from breaking cross-dataset linkage.
Quasi-identifier risk remains after pseudonymization; grouped nationality plus age suppression (k >= 5) reduces small-cell risk while keeping analytical value.
Researcher-facing outputs are limited to four pseudonymized files (offender, victim, incident, and linked records); the key mapping table is kept separately in a secure folder accessible only to the data controller and is never released.
For this prototype a synthetic demo key can be used when ALLOW_DEMO_KEY=true is set; in production the key must be supplied by the data controller from a secure vault or HSM.

3. Regulatory Compliance, Methodological Constraints, and Researcher Qualifications

3.1. Legal and Regulatory Framework: GDPR, LED 2016/680, and Wet op het Politieambt

While the General Data Protection Regulation (GDPR) establishes the overarching privacy-by-design principles reflected in this pipeline, the primary legislative authority for police data processing in Belgium is Directive (EU) 2016/680 — the Law Enforcement Directive (LED) — transposed into Belgian law by the Law of 30 July 2018. Data held for law enforcement purposes does not fall under the GDPR directly; the LED and the Wet op het Politieambt (WPA) together govern the lawfulness, retention, and research-access conditions that apply to Belgian police datasets. When the same data are disclosed to a non–law-enforcement controller (e.g., a university) for independent research, that downstream processing is normally subject to the GDPR.

Table 3.1 is a compliance map. It links each technical mechanism in the pipeline to the identifier or risk it addresses and to the corresponding legal basis under the GDPR, the LED, and the Wet op het Politieambt.

Table 3.1: Legal compliance mapping
Mechanism	Identifier addressed	Legal basis (GDPR / LED 2016/680 / WPA)
HMAC-SHA256 pseudonymization	PV number, Person ID, RRN	GDPR Art. 4(5); LED Art. 3(5) — pseudonymization; key held by controller
Age group (not exact DOB)	Date of birth	GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — data minimisation
Grouped nationality categories	Nationality	GDPR Art. 9(1); LED Art. 10 — special categories in criminal justice context
Address removed	Street, house number	GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — minimum necessary data
k-Anonymity suppression	QI combinations	GDPR Art. 89(1); LED Art. 4(3) — proportionate technical safeguards for research (scientific use subject to appropriate safeguards)
Universal person pseudonym	Cross-dataset identity	GDPR Rec. 26; LED Art. 3(5) — re-identification must not be reasonably possible for recipients; remains personal data for the controller
Key held separately	De-pseudonymization	GDPR Art. 25; LED Art. 20; LED Art. 29; WPA Art. 44/1; WPA Art. 44/11, §§ 7–14 — data protection by design and security of processing

Key operational requirement for the UGent Crime Lab project. The consistent pseudonymization key is the technical core of the anonymization algorithm. It must be generated once per data release by the data controller (local police zone), stored in a hardware security module (HSM) or certified key vault and never held in the research environment, rotated per researcher batch to prevent cross-batch re-identification, and audited in accordance with the GDPR Art. 30 processing register obligation. This key management approach reflects data protection by design and security duties in GDPR Art. 25, LED Arts. 20 and 29, and WPA Art. 44/1.

Wet op het Politieambt (WPA) — Belgian Police Act. Belgian police data is additionally governed by the Wet op het Politieambt (WPA, Belgisch Staatsblad 22 December 1992, as amended). Art. 44/1 WPA requires that personal data collected in the performance of police duties be accurate, adequate, relevant and not excessive. Art. 44/3 WPA mandates retention periods proportionate to purpose. Art. 44/7 WPA grants data subjects rights of access and correction. The pseudonymization pipeline directly operationalises the WPA data minimisation requirement: only the minimum attributes needed for scientific analysis are transferred to the researcher; all surplus personal data is removed or transformed before the data leaves the police system.

LED 2016/680 — EU Law Enforcement Directive. Police data held for law enforcement purposes falls under Directive (EU) 2016/680 (transposed in Belgium by the Law of 30 July 2018), not the GDPR directly. Key distinctions from the GDPR are as follows: lawfulness of processing derives from national law (LED Art. 8, not GDPR Art. 6); research access requires a documented scientific purpose with minimum-necessary data (LED Art. 4(3)); and special categories including ethnic origin, health data, and criminal history are subject to LED Art. 10. The compliance table above maps each mechanism to both its GDPR analogue and the corresponding LED and WPA article.

3.2. Methodological Constraints in Police-Data Anonymization

The report does not demonstrate every operational challenge of a full police-data release system, but it does show three constraints that remain central in practice. First, spatial masking does not affect every criminological analysis in the same way: in Part I the log-distance and boundary-free 1 km familiarity terms are more sensitive to point displacement than several ward-level opportunity covariates, so masking has to be validated against the target analysis rather than chosen abstractly. Second, linked files remain usable only when pseudonymization is deterministic, because joins fail if key use or identifier formatting varies across extracts. Third, even after direct identifiers are removed, linked researcher files can still contain rare quasi-identifier combinations, which is why suppression and coarsening remain necessary before release.

3.3. Researcher Competencies and the Path Forward

This report demonstrates the technical foundations of a privacy-preserving pipeline for Belgian police data. The three areas below describe not only the competencies brought to bear in this demonstration, but specifically how I plan to carry the project forward into an operational data release system for the UGent Crime Lab.

Spatial criminology, scale, and model sensitivity The crime location choice framework applied in Part I will guide masking decisions in the operational pipeline: because spatial masking affects different covariates differently, each new release will need to be validated against the target analysis rather than evaluated in the abstract. I plan to extend this validation framework to additional offence types and to Belgian spatial units, building a masking-quality evidence base that is directly reusable across research projects within the Crime Lab.

Linked-data handling and pseudonymization logic The pseudonymization pipeline built in Part II is designed for extension. The next steps are to integrate the canonical PV and RRN wrappers into the data controller's extract workflow, define key rotation procedures per researcher batch, and test the pipeline against realistic multi-zone Belgian police extracts. The goal is a documented, repeatable release procedure that any police zone data manager can follow without specialist programming knowledge.

Reproducible implementation and risk auditing The risk audit framework — quasi-identifier analysis, k-anonymity checks, suppression logging — will be packaged as a standalone module (implemented in R or Python, depending on the operational environment) that the data controller can run before each release. I plan to document the module so that it satisfies the GDPR Art. 30 processing register obligation and can be presented to a Data Protection Officer as structured evidence of the technical safeguards in place.

Security note on key mapping table: A mapping between real identifiers (RRN, PV numbers) and their pseudonyms has been written to a secure folder accessible only to the data controller. This file is for police system use only. It must never be shared with researchers or stored in the research environment. In production, this table would be held in the data controller's HSM or certified key vault, satisfying GDPR Art. 32.

All names, RRN numbers, and case details in this section are fully synthetic. No real persons are represented.

References

Belgian Federal Government. (1992). Wet op het Politieambt [Belgian Police Act]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

Belgian Federal Government. (2018). Wet van 30 juli 2018 betreffende de bescherming van natuurlijke personen met betrekking tot de verwerking van persoonsgegevens [Law of 30 July 2018 on the protection of natural persons with regard to the processing of personal data]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

Bernasco, W., & Nieuwbeerta, P. (2005). How do residential burglars select target areas? A new approach to the analysis of criminal location choice. British Journal of Criminology, 45(3), 296-315. https://doi.org/10.1093/bjc/azh070

European Union Agency for Cybersecurity. (2019). Pseudonymisation techniques and best practices: Recommendations on shaping technology according to data protection and privacy provisions. Publications Office of the European Union. https://www.enisa.europa.eu/publications/pseudonymisation-techniques-and-best-practices

European Parliament and Council of the European Union. (2016a). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88. https://eur-lex.europa.eu/eli/reg/2016/679/oj

European Parliament and Council of the European Union. (2016b). Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data. Official Journal of the European Union, L 119, 89-131. https://eur-lex.europa.eu/eli/dir/2016/680/oj

Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23(7), 1025-1044. https://doi.org/10.1068/a231025

Kuralarasan, K., & Bernasco, W. (2022). Location choice of snatching offenders in Chennai City. Journal of Quantitative Criminology, 38, 673-696. https://doi.org/10.1007/s10940-021-09514-9

Openshaw, S. (1984). The modifiable areal unit problem (Concepts and Techniques in Modern Geography No. 38). Geo Books.

Song, G., Bernasco, W., Liu, L., Xiao, L., Zhou, S., & Liao, W. (2019). Crime feeds on legal activities: Daily mobility flows help to explain thieves' target location choices. Journal of Quantitative Criminology, 35(4), 831-854. https://doi.org/10.1007/s10940-019-09406-z

Steenbeek, W., & Weisburd, D. (2016). Where the action is in crime? An examination of variability of crime across different spatial units in The Hague, 2001-2009. Journal of Quantitative Criminology, 32(3), 449-469. https://doi.org/10.1007/s10940-015-9276-3

Sweeney, L. (2000). Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. http://dataprivacylab.org/projects/identifiability/paper1.pdf

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570. https://doi.org/10.1142/S0218488502001648

Weisburd, D., Groff, E. R., & Yang, S.-M. (2012). The criminology of place: Street segments and our understanding of the crime problem. Oxford University Press.

© 2026 Dr. Kuralarasan Kumar. This document and its methodology are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). All synthetic data used in Part II are fully artificial; no real persons are represented.

Anonymizing Crime Data for Research

A prototype pipeline for spatial masking, pseudonymization, and analytical validity in Belgian police records

Dr. Kuralarasan Kumar

April 2026