Scope of the report

Police-derived crime data always carries two disclosure problems at once: sensitive locations that reveal where crimes happened and where offenders live, and identifiers — PV numbers, RRN, person attributes — that allow the same individual to be traced across files and datasets. In practice, any anonymization pipeline must handle both simultaneously: masking locations without destroying the analytical model, and pseudonymizing identifiers without breaking the cross-file joins that make linked-data research possible.

This document demonstrates that both problems can be addressed within a reproducible workflow:

  • Part I shows what happens to analytical results when location data is masked — i.e., does the model still give the same answers after spatial transformation?
  • Part II shows how person and case identifiers can be transformed so that cross-file joins still work for researchers — i.e., can you still link offender to victim to incident after pseudonymization?
  • Part III situates the demonstrated techniques within the applicable regulatory framework (GDPR, LED 2016/680, Wet op het Politieambt), identifies the principal methodological constraints that any operational police-data release must address, and maps the researcher's competencies onto those constraints.

Together they form the analytical and technical foundation of a pipeline that is both privacy-protective and analytically valid.

1. Chennai Spatial Masking Analysis

Part I applies four spatial masking methods — two grid aggregation levels and two geomasking radii — to a real street robbery dataset from Chennai, India. Each method shifts or aggregates crime incident points and offender home locations to reduce location disclosure risk. The masked datasets are then used to re-estimate the crime location choice model, and the results are compared against the unmasked baseline. The central question is how much the model coefficients change: a small deviation means the masked data can still support the same analytical conclusions, while a large deviation means the transformation has distorted the results beyond practical use.

1.1. Chennai baseline and comparison logic

The baseline follows the Model 2 specification from the published Chennai snatching study (Kuralarasan & Bernasco, 2022, Table 3), with one adaptation: the same-ward prior-crime indicator is replaced by a boundary-free 1 km familiarity indicator. This change makes the model less sensitive to arbitrary administrative boundaries once crime and home points are spatially masked. Table 1.1 below ranks the tested methods by overall deviation from the unmasked result.

This comparison is also relevant from the perspective of spatial units of analysis. A substantial literature on the modifiable areal unit problem shows that model coefficients can change when phenomena are aggregated into different spatial units, especially when administrative boundaries do not align well with the behavioral processes under study (Fotheringham & Wong, 1991; Openshaw, 1984). In spatial analysis specifically, variation across spatial units is often large enough to alter interpretation (Steenbeek & Weisburd, 2016; Weisburd et al., 2012). For that reason, spatial masking should be evaluated not only as a privacy intervention but also as a change in the spatial representation through which offender decision-making is measured.

The prior-crime familiarity indicator is especially sensitive to administrative boundary dependence. A same-ward indicator is administratively convenient, but it registers a boundary crossing as a meaningful change in offender familiarity even when the masked point remains geographically close to its original position. A boundary-free 1 km indicator is therefore more robust to small positional shifts and more consistent with arguments in spatial criminology that offenders' awareness spaces and relevant opportunity structures do not necessarily coincide with administrative borders (Bernasco & Nieuwbeerta, 2005; Song et al., 2019).

Table 1.1: Masking methods vs baseline

Method

RMSE

Max abs. bias

Most affected covariate

Crime point in same ward (%)

Home point in same ward (%)

Grid 250m

0.017

0.038

Marriage halls (per 10)

86.5

87.2

Geomask 50-300m

0.028

0.075

Mosques (per 10)

79.0

79.3

Grid 500m

0.040

0.124

Any prior crime within 1 km (0,1)

78.7

79.1

Geomask 200-600m

0.084

0.336

Any prior crime within 1 km (0,1)

59.0

59.2

Grid 1000m

0.091

0.224

Mosques (per 10)

58.1

62.6

Geomask 400-1200m

0.114

0.496

Any prior crime within 1 km (0,1)

34.5

32.0

Table 1.1 is a joint summary of overall model drift and spatial reassignment stability. RMSE reports the root mean square error of coefficient deviation from the unmasked model, while Max abs. bias shows the single largest coefficient shift within each masking method. The Most affected covariate column identifies which coefficient is most distorted, and the two ward-stability columns show how often masked crime points and home points remain in their original wards after masking — together, these columns reveal not only which method ranks best overall, but also why some methods degrade more quickly than others.

The best-performing masking method in this analysis is Grid 250m. It has the lowest RMSE (0.017), the smallest maximum single-term bias (0.038), and the highest ward stability for both crime points and home points (about 86%87%). The second-ranked method remains relatively close to the baseline, but already shows lower ward stability (about 79%) and a larger maximum bias. The next masking level still preserves the broad substantive pattern, but it is the first method where the boundary-free familiarity term becomes the worst-affected coefficient, indicating that the model starts to feel the effects of more frequent spatial reassignment. The wider masking settings show the same pattern more clearly: once ward stability drops into the 50% range and below, coefficient distortion increases substantially even though the general direction of the main findings remains recognizable.

1.2. Baseline vs best masking method

Table 1.2 compares the unmasked baseline directly against the best-performing masking method. It reports odds ratios, p-values, and the percentage change in the odds ratio. This is the most direct way to judge whether masking changes the substantive interpretation of the Chennai model under the boundary-free familiarity specification.

Table 1.2: Baseline vs best masking method (Grid 250m)

Term

Baseline OR

Baseline p

Masked OR

Masked p

OR diff.

% OR change

Distance (km)

0.356

0.000

0.350

0.000

-0.007

-1.832

Any prior crime within 1 km (0,1)

10.018

0.000

9.741

0.000

-0.277

-2.767

Area (km2)

1.075

0.000

1.092

0.000

0.017

1.576

Population (per 1,000)

1.004

0.208

1.003

0.403

-0.001

-0.132

Retail stores (per 10)

1.010

0.358

1.013

0.261

0.002

0.228

Transit stations (per 10)

1.039

0.432

1.031

0.524

-0.008

-0.727

Mosques (per 10)

1.073

0.450

1.093

0.341

0.020

1.859

Temples (per 10)

1.009

0.760

0.980

0.494

-0.029

-2.916

Churches (per 10)

1.182

0.000

1.161

0.000

-0.021

-1.757

Education institutions (per 10)

1.072

0.000

1.066

0.000

-0.005

-0.507

School and college (per 10)

0.989

0.669

0.994

0.814

0.005

0.495

Personal care (per 10)

1.066

0.003

1.052

0.021

-0.015

-1.379

Hospitals (per 10)

0.999

0.950

1.000

0.994

0.001

0.113

Marriage halls (per 10)

1.123

0.004

1.168

0.000

0.044

3.921

Jewelleries (per 10)

1.025

0.163

1.035

0.045

0.010

1.003

Textiles (per 10)

0.987

0.128

0.984

0.058

-0.003

-0.331

Park (per 10)

1.176

0.017

1.164

0.024

-0.012

-1.018

Recreation facilities (per 10)

0.957

0.124

0.984

0.556

0.026

2.743

Restaurant (per 10)

1.028

0.033

1.035

0.009

0.006

0.612

Government office (per 10)

1.058

0.013

1.062

0.007

0.004

0.375

The main findings remain stable under the best masking method. The distance effect remains negative and strong, and the boundary-free 1 km familiarity effect remains strongly positive. In the HTML version, the masked p-value column is colour-coded by significance stability relative to the baseline, while the absolute difference and percentage change in the odds ratio are shaded by the size of the deviation. The core criminological interpretation therefore does not reverse under the best-performing masking specification.

1.3. Coefficient Patterns and Overall Deviation

The two figures below complement Tables 1.1 and 1.2 by showing the same masking results at two different levels of summary. Figure 1.1 is coefficient-specific: it shows how each odds ratio moves across masking methods relative to the unmasked baseline, marked by the dashed line in each panel. This makes it possible to see which terms remain tightly clustered across methods and which terms drift more clearly as masking becomes stronger. Figure 1.2 then collapses that information into a single method-level ranking using RMSE, so it should be interpreted as a compact summary of overall deviation rather than as a substitute for the term-by-term comparison in Table 1.2.

Figure 1.1: Odds ratios by masking method

Figure 1.1: Odds ratios by masking method

Figure 1.2: RMSE ranking of masking methods

Figure 1.2: RMSE ranking of masking methods

Taken together, the figures reinforce the pattern already visible in the tables. The smaller masking settings remain much closer to the unmasked baseline, while the wider geomasking and coarser grid methods introduce visibly larger coefficient movement. The coefficient plot also shows that distortion is not evenly distributed across terms: some ward-level opportunity covariates remain comparatively stable, whereas the familiarity term and a smaller number of place-based covariates become more sensitive as spatial reassignment becomes more common. The RMSE ranking in Figure 1.2 is therefore best interpreted as a summary of a broader pattern already visible in Figure 1.1, not as an isolated performance score.

1.4. Robustness Check Without a Prior-Location Covariate

This section reports a reduced model excluding the prior-location covariate. The main Chennai specification retains a boundary-free prior-familiarity covariate based on whether the offender had previously offended within 1 km of the candidate location, because that measure is less sensitive to arbitrary boundary crossings than a same-ward indicator. A reasonable concern, however, is that the overall masking results might partly depend on that modeling choice. To address that concern, I also estimated a reduced specification that drops the prior-location covariate entirely and re-runs the same masking comparison on the reduced model.

This reduced model is not treated as the preferred substantive specification. It removes an important mechanism of spatial familiarity and therefore answers a more limited question: if that mechanism is omitted altogether, do the masking results still show the same broad ranking pattern? Read in that way, the reduced model is a robustness check on the masking comparison, not a replacement for the main model. Table 1.3 presents the results.

Table 1.3: Robustness (reduced model)

Robustness specification

Best masking method

Best RMSE

Crime point stays in same ward (%)

Home point stays in same ward (%)

Reduced model (drop prior-crime term)

Geomask 50-300m

0.015

79.8

80

In that reduced model, the best masking method was Geomask 50-300m with an RMSE of about 0.015, which is slightly lower than the main 1 km familiarity model. That lower RMSE should not be over-interpreted as evidence that the reduced model is substantively better. With one less behaviorally important covariate to preserve, the reduced specification is simply easier to reproduce after masking. The important point is that the smaller masking settings still perform best, while the wider masking settings still introduce visibly more distortion. In other words, the broad masking pattern does not disappear when the prior-location covariate is omitted, but the main specification remains preferable because it preserves a more meaningful behavioural mechanism.

1.5. Part I Summary

Part I is a baseline-deviation study: it measures how far the masked model moves away from the original fitted result, rather than attempting to recover true population parameters. The main result is that smaller masking settings preserve the original model coefficients most effectively: Grid 250m performs best, Geomask 50-300m remains close, and wider masking settings introduce progressively larger distortion. The reduced-model robustness check supports the same broad conclusion: omitting the prior-location covariate changes the preferred specification, but it does not overturn the overall ranking pattern in which smaller masking settings are analytically safer. The practical implication is therefore straightforward: spatial masking can remain compatible with the original analytical conclusion, but only within a relatively limited range of spatial displacement or aggregation.

2. Pseudonymization, PV-Number Consistency & Cross-Dataset Linking

Belgian police data spans roughly 196 local police zones and multiple institutional systems — local operational registers (ISLP/ISLP2), national reference systems (ANG/BNG, FEEDIS), justice environments (JustCase, JustMask), prison systems (Sidis Suite), and forensic research infrastructures (DOT, be.care). The same person, case, or event can therefore appear across these separate systems under different local formatting conventions. Without an algorithmic approach, zones anonymize data inconsistently: one zone may redact birth year while another retains it; one may suppress nationality while another codes it differently. Aggregated research then carries systematic measurement error invisible to the researcher, and cross-dataset joins on PV numbers or person identifiers break silently.

Removing names is not sufficient to address this. PV numbers, the national registry number (Rijksregisternummer/RRN), and date of birth together can still support re-identification, and stripping them with a fresh random ID each time destroys the cross-file joins that make linked-data research possible. This section demonstrates that consistent keyed pseudonymization — applying HMAC(key, RRN) deterministically so the same person always receives the same pseudonym regardless of which zone or system extracted the record — solves both problems simultaneously: direct identifiers are removed and cross-dataset linkage is preserved. De-pseudonymization is only possible with the secret key, held by the data controller.

2.1. Three Synthetic Belgian Police-Style Datasets

The three synthetic datasets used below — crime incidents, offender records, and victim records — were generated to resemble Belgian police/crime data. All fields, identifiers, and record structures follow Belgian police data conventions, but no real personal data are used.

2.1.1 Synthetic Incident Register (Dataset 1)

Table 2.1 shows the structure of the synthetic incident register before pseudonymization. It represents recorded criminal incidents registered through PV-based workflows in a Belgian police data structure. The important analytical fields are the PV number, date, offence type, neighbourhood, and injury/outcome variables.

Table 2.1: Synthetic incident preview (first 4)

PV Number

Date

Offence Type

Neighbourhood

Injury

2022/GNT/05041

2023-08-04

Property crime

Wondelgem

Geen

2024/GNT/03266

2022-02-18

Public order / substance

Sint-Amandsberg

Geen

2022/GNT/05326

2022-03-12

Property crime

Mariakerke

Geen

2022/GNT/09504

2024-07-07

Public order / substance

Mariakerke

Geen

Taken together with the person-level files below, this incident register provides the case-level anchor for later linkage checks.

2.1.2 Offender Records — Dader Register (Dataset 2)

Some offenders committed multiple crimes (repeat offenders - same person_id in multiple rows).

Table 2.2 is a person-event file: the same person can recur across multiple PV records, making linkage preservation a central requirement of the anonymization pipeline.

Table 2.2: Synthetic offender preview (first 4)

Person ID

PV Number

Gender

Nationality

Role

D0057

2021/GNT/03774

V

Duits

Verdachte

D0044

2022/GNT/03783

M

Congolees

Verdachte

D0028

2024/GNT/03266

M

Roemeens

Verdachte

D0041

2024/GNT/09571

V

Nederlands

Verdachte

2.1.3 Victim Records — Slachtoffer Register (Dataset 3)

Some victims appear in multiple incidents. Critically, some persons are BOTH offender and victim in different cases — a real pattern in interpersonal violence data.

Table 2.3 mirrors the offender file from the victim side, showing that the eventual release logic has to preserve not only offender-to-incident joins, but also cross-role person linkage across files.

Table 2.3: Synthetic victim preview (first 4)

Person ID

PV Number

Gender

Nationality

Relation to Suspect

S0016

2024/GNT/05463

M

Duits

Kennis

S0037

2022/GNT/06644

M

Pools

Onbekend

S0061

2023/GNT/08088

M

Frans

Collega

S0066

2024/GNT/03176

M

Duits

Familielid


2.2. Linkage Failure Under Naive Anonymization

A naive approach strips direct identifiers (name, RRN, address) and replaces each identifier with a new random ID every time. This is the most common mistake in ad-hoc anonymization.

Why this fails: Random replacement generates a different code for 2022/GNT/00341 in the offender file and a completely different code in the victim file. Researchers cannot join on PV number — the data becomes useless for cross-file analysis.

The root cause is non-determinism: because each call to sample() is independent, the same input value produces a different output in each file. The fix is not to add more redaction — it is to replace randomness with a deterministic function: one that always maps the same input to the same output, using a secret key that only the data controller holds. Table 2.2b quantifies this failure using actual join counts.

Table 2.2b: Naive anonymization linkage failure

Join type

Raw data

After naive anonymization

Result

Offender → Victim (shared PV number)

54 shared PV numbers

0 matched rows

✗ Complete linkage failure

2.3. Consistent Keyed Pseudonymization as a Solution

I propose to use HMAC-SHA256, which can be understood here as a standard keyed hashing method. It turns personal identifiers such as an RRN or PV number into a stable pseudonym using a secret key applied consistently across all files (see §2.2 for why this determinism is the core requirement). The resulting output is a 24-character hexadecimal token: collision-resistant and non-invertible without the key.

A terminological point is important here. Under GDPR Recital 26 and Art. 4(5) (as reiterated in EDPB Guidelines 01/2025 on Pseudonymisation), data is pseudonymized — not anonymized — as long as a key exists that could link the pseudonym back to the original identifier. This pipeline is therefore correctly described as a privacy-preserving transformation, not full anonymization: it replaces direct identifiers with stable pseudonyms while keeping the key under the exclusive control of the data controller. Full anonymization would require the key to be permanently destroyed, which would also permanently destroy the ability to audit or correct the released data. The approach here is deliberately and correctly pseudonymization.

ENISA discusses keyed-hash / HMAC-style approaches as valid pseudonymization techniques (ENISA, 2019), and this implementation is compatible with GDPR Art. 4(5).

The same HMAC key is applied to all three datasets. Each direct identifier — RRN, PV number, name, address — is either pseudonymized or removed. The researcher-facing files retain only analytical fields and stable pseudonyms.

2.3.1 Structure of the Pseudonymized Research Files

The researcher-facing files shown below retain analytical fields and stable pseudonyms, but remove direct identifiers such as names, RRN, and raw PV numbers.

Table 2.4: Pseudonymized offender preview (first 5)

PV Pseudonym

Person Pseudonym

Gender

Nationality

Role

PV-841A5F319D8B609FB41C7E8C

D-F23B9BA25B6EC33EF0F3BD2C

V

Duits

Verdachte

PV-85A517390E8F41928753A119

D-523BA24A18253BDFC7A8C87D

M

Congolees

Verdachte

PV-7E79B32DE9FE421A4CA45E08

D-1895C0DB1C22DB37728E97CE

M

Roemeens

Verdachte

PV-01FB2838F77F4CBDEAD4396E

D-80C0220A9292AD1AE3170D97

V

Nederlands

Verdachte

PV-CE1B202E69F37EAF63F9FD98

D-AC4CC429F7F3D5489E14522B

V

Belgisch

Verdachte

Table 2.5: Pseudonymized victim preview (first 5)

PV Pseudonym

Person Pseudonym

Gender

Nationality

Role

PV-BDD76BBF3F1C5F8909C5DD9D

S-F71BABA98A2E1DA1A8D494C6

M

Duits

Slachtoffer

PV-C19BF58E43D7145B7155580B

S-ACF088B1C241F2BD6C561B79

M

Pools

Slachtoffer

PV-661E861336D7E15104CD9F69

S-AB252BC3C6E5F0B6B00863D2

M

Frans

Slachtoffer

PV-B87EB997C7140CFAA42EE97A

S-7DF963006DA72F97BF4CF224

M

Duits

Slachtoffer

PV-7053CD96B4018C7075F2DC80

S-3D6FEB2A6CA2AB892563603F

M

Duits

Slachtoffer

Tables 2.4 and 2.5 should be interpreted as structure checks: they show that the research files still contain the fields needed for analysis and linkage, but no longer expose direct personal identifiers.

2.4. Cross-Dataset Linkage Integrity Verification

The core claim of consistent keyed pseudonymization is that all analytical joins work identically before and after anonymization. Table 2.5b summarises this as a scorecard across three join types demonstrated in Sections 2.4.1–2.4.3:

  1. Offender → Incident (Table 2.6): offender-level attributes — pseudonymous person ID, age group, nationality — linked to case-level incident data via the shared PV pseudonym (the pseudonymized case number).
  2. Cross-role person linkage (Table 2.7): the same individual appearing as offender in one case and as victim in another, traced via the universal person pseudonym without revealing their identity.
  3. Repeat offender history (Table 2.8): all case records attributed to one pseudonymous person, assembled in chronological order for longitudinal analysis.

This matters not only for longitudinal or cross-file analysis, but also for network analysis and intelligence-led or forensic insight, because stable person- and case-level pseudonyms allow co-involvement, repeat contacts, and event relationships to be reconstructed without exposing direct identifiers.

Table 2.5b: Linkage integrity scorecard

Join type

Raw data

After pseudonymization

Preserved?

Incidents raw count

120

120

100% (ok)

Unique PV numbers (incidents)

120

120

100% (ok)

Unique offender persons (by RRN/pseudonym)

42

42

100% (ok)

Offender → Incident joins (PV number)

85

85

100% (ok)

Offender → Victim joins (same person)

4

4

100% (ok)

The scorecard should be read by comparing each join count before and after pseudonymization. The Raw data column uses real identifiers such as RRN and PV number, whereas the After pseudonymization column uses only the derived pseudonyms. Identical counts show that the intended joins are preserved after direct identifiers are transformed. That result demonstrates linkage integrity, but it does not imply that the release is risk-free: quasi-identifier risk remains and is evaluated in the next sections.

2.4.1 Offender–Incident Linkage via PV Pseudonym

Table 2.6 shows the simplest preserved join in the release pipeline: offender attributes remain linkable to incident-level case information through the shared PV pseudonym. Two key column labels require clarification: the PV pseudonym is the pseudonymized case identifier (derived from the original PV number and stable across all files); the Person pseudonym is the pseudonymized offender identifier (derived from the RRN via HMAC-SHA256). Neither field contains any direct personal information — they serve solely as stable, researcher-safe linkage keys.

Table 2.6: Offender–incident link (sample rows)

PV pseudonym

Person pseudonym

Gender

Age group

Nationality

Prior PVs

Known to police

Date

Offence type

Neighbourhood

Injury

PV-841A5F319D8B609FB41C7E8C

PRS-F23B9BA25B6EC33EF0F3BD2C

V

Duits

0

Nee

2021-07-01

Property crime

Muide

Zwaar

PV-85A517390E8F41928753A119

PRS-523BA24A18253BDFC7A8C87D

M

Congolees

1

Ja

2022-07-28

Public order / substance

Sint-Amandsberg

Geen

PV-7E79B32DE9FE421A4CA45E08

PRS-1895C0DB1C22DB37728E97CE

M

Roemeens

0

Nee

2022-02-18

Public order / substance

Sint-Amandsberg

Geen

PV-01FB2838F77F4CBDEAD4396E

PRS-80C0220A9292AD1AE3170D97

V

Nederlands

0

Nee

2023-11-21

Public order / substance

Bloemekenswijk

Licht

PV-CE1B202E69F37EAF63F9FD98

PRS-AC4CC429F7F3D5489E14522B

V

Belgisch

1

Ja

2021-08-31

Property crime

Muide

Geen

PV-B87EB997C7140CFAA42EE97A

PRS-4700C7A200C8AB703BE079FA

M

Congolees

1

Nee

2022-12-23

Public order / substance

Sint-Amandsberg

Licht

2.4.2 Cross-Role Person Linkage: Offender and Victim Datasets

As noted in §2.1.3, some persons appear in both the offender and victim registers across different cases — the cross-role pattern characteristic of interpersonal violence data. Table 2.7 confirms that the universal person pseudonym preserves this linkage after pseudonymization: researchers can trace the same individual across roles without ever knowing who that person is.

Table 2.7: Cross-role linkage via universal pseudonym

Person pseudonym

Incident PV No. (offender role)

Role (offender)

Incident PV No. (victim role)

Role (victim)

Relation to suspect

PRS-64439954BA3B5F3509FF73C6

PV-8CAE09F42E9D407D204ED698

Verdachte

PV-ED87DE7C95D7C998D7074D5D

Slachtoffer

Onbekend

PRS-D2811785D27F7368CAFE362B

PV-EF8DB40B581698D0B476B36E

Verdachte

PV-24464F55C28975FB5AA79987

Slachtoffer

Onbekend

PRS-D2811785D27F7368CAFE362B

PV-EF8DB40B581698D0B476B36E

Verdachte

PV-23FFB7E4AD3DF5B710508392

Slachtoffer

Onbekend

PRS-1E4DC75EB6FF9F17A9B1FE74

PV-6DFFC4C792512AC2BEF2DDF5

Verdachte

PV-21358D6B89AD66EE1E0C7406

Slachtoffer

Partner

2.4.3 Repeat Offender Tracking Across Multiple PV Records

Table 2.8 extends the same logic over time: it shows that repeat involvement across multiple PV records can still be reconstructed for one pseudonymous individual, which is essential for longitudinal offending analysis.

Table 2.8: Criminal history of offender PRS-12DC66B62F83CA4CE2DDE998 — linked via universal pseudonym

Person pseudonym

PV pseudonym

Date

Offence type

Neighbourhood

Gender

Age group

Nationality

Prior PVs

PRS-12DC66B62F83CA4CE2DDE998

PV-795D706A47D6B9B8313E718D

2021-05-23

Violence / interpersonal

Muide

M

Belgisch

0

PRS-12DC66B62F83CA4CE2DDE998

PV-661E861336D7E15104CD9F69

2021-08-01

Violence / interpersonal

Sint-Amandsberg

M

Belgisch

3

PRS-12DC66B62F83CA4CE2DDE998

PV-CC570D1B8B80400F821D559F

2021-08-02

Violence / interpersonal

Gentbrugge

M

Belgisch

3

PRS-12DC66B62F83CA4CE2DDE998

PV-E871AF0588D26888F08E6816

2023-08-19

Violence / interpersonal

Mariakerke

M

Belgisch

1

PRS-12DC66B62F83CA4CE2DDE998

PV-7B70B46C910CA605295A4B82

2023-09-02

Property crime

Ledeberg

M

Belgisch

1

PRS-12DC66B62F83CA4CE2DDE998

PV-A9921AD699D6DCB60A9DA459

2023-11-13

Public order / substance

Gentbrugge

M

Belgisch

3

PRS-12DC66B62F83CA4CE2DDE998

PV-9A654E2DCB93B08E64436258

2024-04-04

Property crime

Gentbrugge

M

Belgisch

2


2.5. Re-identification Risk Analysis

Even after removing direct identifiers, quasi-identifiers (age group, nationality, marital status, gender together) can make individuals unique — the combination re-identifies them (Sweeney, 2000: 87% of Americans uniquely identified by ZIP code, date of birth, and sex).

2.5.1 Uniqueness by quasi-identifier combination

Table 2.9 reports quasi-identifier uniqueness in the raw files — before any pseudonymization — to establish a risk baseline. Table 2.10 reports the same metric in the pseudonymized files. The values are expected to be identical across both tables: pseudonymization replaces direct identifiers (names, RRN, PV numbers) but leaves quasi-identifiers — age group, nationality, marital status, and gender — unchanged. The comparison therefore confirms that pseudonymization alone does not eliminate re-identification risk through quasi-identifier combinations, which is why the k-anonymity suppression step in Section 2.5.3 remains necessary.

Table 2.9: Quasi-identifier risk (raw files)

Dataset

Records

Distinct combos

Unique records (n)

% unique

Small group (n)

% small group

Offenders (raw)

85

32

10

12%

44

52%

Victims (raw)

110

46

12

11%

74

67%

Table 2.10: Quasi-identifier risk (after pseudonymization)

Dataset

Records

Distinct combos

Unique records (n)

% unique

Small group (n)

% small group

Offenders (pseudonymized)

85

32

10

12%

44

52%

Victims (pseudonymized)

110

46

12

11%

74

67%

Tables 2.9 and 2.10 should be read comparatively. The percentages are record-level shares, not shares of distinct combination types. The identical before/after figures confirm what the section introduction above explains: pseudonymization does not alter the quasi-identifiers used in the risk calculation.

2.5.2 k-Anonymity check per nationality/age group cell

That residual uniqueness risk is why the k-anonymity check below is still needed. Figure 2.1 visualises the cell-level risk distribution by age group, making it possible to see at a glance which age groups are most exposed before suppression is applied.

Figure 2.1: k-anonymity risk by age group

Figure 2.1: k-anonymity risk by age group

2.5.3 k-Anonymity Suppression (k >= 5 Threshold)

k-Anonymity requires that every record in a release file shares its quasi-identifier combination with at least k-1 other records, making individual re-identification at most a 1-in-k probability (Sweeney, 2002). To reduce the privacy risk created by raw country labels while retaining interpretability, nationality is released in grouped categories rather than being suppressed immediately. The main release scheme is Belgian / EU (non-Belgian) / Non-EU, with a fallback to Belgian / Non-Belgian when the 3-group cell still falls below the chosen threshold in this small synthetic sample. Table 2.11 defines the release categories, their source label coverage, and the privacy and analytical rationale for each.

Table 2.11: Nationality release grouping (recommended)

Release category

Example source labels

Privacy rationale

Analytical usefulness

Recommended for small police datasets

Belgian

Belgisch

Keeps the domestic reference category while removing country-level specificity.

Preserves the key contrast between Belgian and non-Belgian records.

Yes

EU (non-Belgian)

Nederlands, Frans, Duits, Italiaans, Pools, Portugees, Roemeens, Spaans

Collapses several country labels into a broader region, reducing uniqueness from rare EU nationalities.

Retains a meaningful European mobility category without exposing exact country labels.

Yes

Non-EU

Congolees, Marokkaans, Turks

Absorbs the highest-risk rare-country labels into one broad release category.

Preserves a coarse but policy-relevant distinction for descriptive analysis.

Yes

Fallback: Non-Belgian

Applied when the 3-group cell is still below k

Further reduces small-cell risk in very small police datasets.

Provides a pragmatic fallback when the 3-group scheme remains too sparse.

Yes

Table 2.12: Release summary after suppression

Dataset

Released in 3 groups (n)

% recoded to 3 groups

Fallback to 2 groups (n)

% fallback 2-group

Age group suppressed (n)

% age suppressed

Offenders

54

63.5

31

36.5

3

3.5

Victims

64

58.2

46

41.8

1

0.9

Table 2.12 quantifies the practical cost of the final release logic. In small police datasets, grouped nationality categories often retain more analytical meaning than blanket suppression, but some records still need a fallback to Belgian versus non-Belgian and some age groups still need suppression when the corresponding age-by-gender cell remains below the threshold.

2.6. End-to-End Research Workflow

This is the complete pipeline a researcher receives. They have no access to real identifiers — only pseudonymous IDs and generalised attributes — yet they can perform full longitudinal and cross-dataset analysis.

Figure 2.2 shows the two concrete transformations the pipeline applies to offender records. The top row compares raw country-level nationality labels (Panel A) against the release categories (Panel B): most records are released under the 3-category system (Belgian / EU / Non-EU, shown in blue); records whose 3-category cell fell below k=5 are recoded to the 2-category fallback (Belgian / Non-Belgian, shown in orange). The bottom row restricts to records that had a valid age before suppression and shows which age-gender cells survive intact (Panel C) and which fall below the k=5 threshold and are withheld from the release (Panel D, grey bars). Figure 2.3 then confirms that neighbourhood-level cross-tabulation remains meaningful after the same protections are applied.

Figure 2.2: Nationality grouping and age suppression — before vs after pipeline (offender records)

Figure 2.2: Nationality grouping and age suppression — before vs after pipeline (offender records)

We preserve neighbourhood names as analytical attributes while all person and case identifiers are pseudonymized. The heatmap below shows whether spatial cross-tabulation still works after these protections.

Figure 2.3: Offence type by neighbourhood (post-suppression)

Figure 2.3: Offence type by neighbourhood (post-suppression)

Taken together, Figures 2.2 and 2.3 show that the release remains suitable for standard descriptive and exploratory analysis across age, offence type, and neighbourhood, even though direct identifiers have been removed and some attributes have been generalised.

2.7. Residual Risks: Checks That Pseudonymization Alone Cannot Address

Sections 2.3–2.6 demonstrate that deterministic keyed pseudonymization solves the linkage problem and eliminates direct identifiers. However, two categories of risk remain after pseudonymization is complete and require separate treatment before a research extract can be safely released. They are not deficiencies of the pipeline: they are inherent to any structured data release and must be addressed at the release-preparation stage.

2.7.1 Residual Risk 1: Temporal Precision and Date Generalisation

Exact timestamps can become identifying when combined with offence category, neighbourhood, or person-level attributes — even when all direct identifiers have been removed. This risk is structural: it arises from the precision of the data itself, not from any failure of the pseudonymization step.

Table 2.13: Temporal precision vs. re-identification risk

Precision level

Unique combinations

Total records

% unique

Re-id risk

Exact (date + HH:MM)

119

120

99.2

High

Date + hour of day (24)

118

120

98.3

High

Date + time band (4)

117

120

97.5

High

Week + time band

108

120

90.0

High

Month + time band

87

120

72.5

High

Table 2.13 should be interpreted as a trade-off table rather than a fixed rule. Its purpose is to show how quickly uniqueness drops as temporal precision is coarsened, and therefore how a data controller could justify releasing broader time bands instead of exact timestamps.

2.7.2 Residual Risk 2: Narrative Field Handling and Free-Text

Structured pseudonymization is not enough when police records include narrative text. Names, RRNs, addresses, and times often remain embedded in natural language fields and therefore require a separate redaction step before a research extract can be released. Unlike temporal precision, which can be addressed by coarsening a date field, free-text risk requires entity recognition: the system must detect that a phrase is a name or an address before it can redact it. Table 2.14 illustrates this transformation on four synthetic Dutch-language police narratives.

Table 2.14: Free-text narrative anonymization (raw vs redacted)

Record ID

Raw narrative

Redacted narrative

PV-A1B2

Op 15/03/2022 om 23:45 werd Jan De Smedt, RRN 86.04.12-234.71, wonende Langestraat 42, 9000 Gent, aangetroffen.

Op [DATUM] om [TIJD] werd [NAAM], RRN [RRN], wonende [ADRES], [POSTCODE], aangetroffen.

PV-C3D4

Het voertuig werd bestuurd door Mohamed El Amrani, geboortedatum 04/04/1990, RRN 90.04.04-123.45, uit Merelbeke.

Het voertuig werd bestuurd door [NAAM], geboortedatum [DATUM], RRN [RRN], uit Merelbeke.

PV-E5F6

Slachtoffer Emma Vandenberghe (geb. 12-06-1978), verblijvend te Veldstraat 18, Gent, deed aangifte van diefstal.

[NAAM] (geb. [DATUM]), verblijvend te [ADRES], Gent, deed aangifte van diefstal.

PV-G7H8

Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om 02:30.

Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om [TIJD].

The regex example is only a baseline used to illustrate the transformation logic. In practice, entity detection in free text needs a stronger NLP layer because names, addresses, and times appear in many formats that simple patterns do not capture reliably.

2.8. Part II Summary

  • Deterministic HMAC-SHA256 pseudonymization preserves all intended joins across offender, victim, and incident files; the linkage scorecard shows 100% preservation in this synthetic demo.
  • Canonicalization of PV numbers and RRNs prevents formatting differences from breaking cross-dataset linkage.
  • Quasi-identifier risk remains after pseudonymization; grouped nationality plus age suppression (k >= 5) reduces small-cell risk while keeping analytical value.
  • Researcher-facing outputs are limited to four pseudonymized files (offender, victim, incident, and linked records); the key mapping table is kept separately in a secure folder accessible only to the data controller and is never released.
  • For this prototype a synthetic demo key can be used when ALLOW_DEMO_KEY=true is set; in production the key must be supplied by the data controller from a secure vault or HSM.

3. Regulatory Compliance, Methodological Constraints, and Researcher Qualifications

3.2. Methodological Constraints in Police-Data Anonymization

The report does not demonstrate every operational challenge of a full police-data release system, but it does show three constraints that remain central in practice. First, spatial masking does not affect every criminological analysis in the same way: in Part I the log-distance and boundary-free 1 km familiarity terms are more sensitive to point displacement than several ward-level opportunity covariates, so masking has to be validated against the target analysis rather than chosen abstractly. Second, linked files remain usable only when pseudonymization is deterministic, because joins fail if key use or identifier formatting varies across extracts. Third, even after direct identifiers are removed, linked researcher files can still contain rare quasi-identifier combinations, which is why suppression and coarsening remain necessary before release.

3.3. Researcher Competencies and the Path Forward

This report demonstrates the technical foundations of a privacy-preserving pipeline for Belgian police data. The three areas below describe not only the competencies brought to bear in this demonstration, but specifically how I plan to carry the project forward into an operational data release system for the UGent Crime Lab.

Spatial criminology, scale, and model sensitivity The crime location choice framework applied in Part I will guide masking decisions in the operational pipeline: because spatial masking affects different covariates differently, each new release will need to be validated against the target analysis rather than evaluated in the abstract. I plan to extend this validation framework to additional offence types and to Belgian spatial units, building a masking-quality evidence base that is directly reusable across research projects within the Crime Lab.

Linked-data handling and pseudonymization logic The pseudonymization pipeline built in Part II is designed for extension. The next steps are to integrate the canonical PV and RRN wrappers into the data controller's extract workflow, define key rotation procedures per researcher batch, and test the pipeline against realistic multi-zone Belgian police extracts. The goal is a documented, repeatable release procedure that any police zone data manager can follow without specialist programming knowledge.

Reproducible implementation and risk auditing The risk audit framework — quasi-identifier analysis, k-anonymity checks, suppression logging — will be packaged as a standalone module (implemented in R or Python, depending on the operational environment) that the data controller can run before each release. I plan to document the module so that it satisfies the GDPR Art. 30 processing register obligation and can be presented to a Data Protection Officer as structured evidence of the technical safeguards in place.

Security note on key mapping table: A mapping between real identifiers (RRN, PV numbers) and their pseudonyms has been written to a secure folder accessible only to the data controller. This file is for police system use only. It must never be shared with researchers or stored in the research environment. In production, this table would be held in the data controller's HSM or certified key vault, satisfying GDPR Art. 32.

All names, RRN numbers, and case details in this section are fully synthetic. No real persons are represented.

References

Belgian Federal Government. (1992). Wet op het Politieambt [Belgian Police Act]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

Belgian Federal Government. (2018). Wet van 30 juli 2018 betreffende de bescherming van natuurlijke personen met betrekking tot de verwerking van persoonsgegevens [Law of 30 July 2018 on the protection of natural persons with regard to the processing of personal data]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

Bernasco, W., & Nieuwbeerta, P. (2005). How do residential burglars select target areas? A new approach to the analysis of criminal location choice. British Journal of Criminology, 45(3), 296-315. https://doi.org/10.1093/bjc/azh070

European Union Agency for Cybersecurity. (2019). Pseudonymisation techniques and best practices: Recommendations on shaping technology according to data protection and privacy provisions. Publications Office of the European Union. https://www.enisa.europa.eu/publications/pseudonymisation-techniques-and-best-practices

European Parliament and Council of the European Union. (2016a). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88. https://eur-lex.europa.eu/eli/reg/2016/679/oj

European Parliament and Council of the European Union. (2016b). Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data. Official Journal of the European Union, L 119, 89-131. https://eur-lex.europa.eu/eli/dir/2016/680/oj

Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23(7), 1025-1044. https://doi.org/10.1068/a231025

Kuralarasan, K., & Bernasco, W. (2022). Location choice of snatching offenders in Chennai City. Journal of Quantitative Criminology, 38, 673-696. https://doi.org/10.1007/s10940-021-09514-9

Openshaw, S. (1984). The modifiable areal unit problem (Concepts and Techniques in Modern Geography No. 38). Geo Books.

Song, G., Bernasco, W., Liu, L., Xiao, L., Zhou, S., & Liao, W. (2019). Crime feeds on legal activities: Daily mobility flows help to explain thieves' target location choices. Journal of Quantitative Criminology, 35(4), 831-854. https://doi.org/10.1007/s10940-019-09406-z

Steenbeek, W., & Weisburd, D. (2016). Where the action is in crime? An examination of variability of crime across different spatial units in The Hague, 2001-2009. Journal of Quantitative Criminology, 32(3), 449-469. https://doi.org/10.1007/s10940-015-9276-3

Sweeney, L. (2000). Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. http://dataprivacylab.org/projects/identifiability/paper1.pdf

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570. https://doi.org/10.1142/S0218488502001648

Weisburd, D., Groff, E. R., & Yang, S.-M. (2012). The criminology of place: Street segments and our understanding of the crime problem. Oxford University Press.

© 2026 Dr. Kuralarasan Kumar. This document and its methodology are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). All synthetic data used in Part II are fully artificial; no real persons are represented.