Scope of the report

In criminological research using crime datasets and related administrative records, two disclosure problems usually have to be addressed separately: sensitive locations and identifiers that support linkage. This document combines two linked demonstrations that address each in turn:

  1. Part I: how alternative masking methods change a real Chennai crime location choice model when compared with the unmasked baseline
  2. Part II: how consistent HMAC-SHA256 pseudonymization preserves cross-dataset linkage in a Belgian police-style linked-data workflow

Part I addresses spatial masking and analytical distortion. Part II addresses identifier protection and cross-file linkage.

Part I presents an empirical Chennai application based on the output set in outputs/chennai_masking_alternatives/prior_within_1km/. The main specification operationalizes prior spatial familiarity with the boundary-free term prior_crime_within_1km_any and compares masked estimates against the unmasked baseline of that same specification. A reduced specification that drops the prior-location term entirely is retained as an appendix robustness check.

Part II (Sections 1--8 of the pseudonymization chapter) presents the identifier-management side of the same pipeline. Three linked synthetic administrative datasets -- incident register, offender records, victim records -- were structured as Belgian police-style linked files, drawing on systems such as ISLP and related police databases. Direct identifiers such as name, RRN, address, and PV number were replaced using keyed pseudonymization. The section then shows whether the intended joins remain valid, whether quasi-identifier risk can be detected and suppressed through k-anonymity, and how temporal and free-text fields can be handled within the same workflow.


Part I: Chennai Real-Data Spatial Masking Analysis

Part I presents a real-data Chennai masking analysis based on the published snatching location choice study. The underlying script and output files are located in scripts/chennai_alternative_masking_analysis.R and outputs/chennai_masking_alternatives/. Because this is a real-data setting, the correct benchmark is the unmasked baseline model, not a set of known true parameters.

How Part I should be read: The masking methods are compared against the unmasked Chennai baseline of the same specification. Lower RMSE means the masked model remains closer to the original fitted result. In the main Chennai specification, both crime points and offender-home points are masked, crime wards are reassigned after masking, and prior spatial familiarity is measured with prior_crime_within_1km_any rather than a same-ward indicator.

1. Chennai baseline and comparison logic

The Chennai analysis uses the real snatching data and an adapted published-style Model 2 structure used in scripts/chennai_alternative_masking_analysis.R: distance, prior_crime_within_1km_any, area, population, and the ward-level opportunity covariates. This familiarity measure is used instead of the original same-ward prior-crime indicator so that the model is less dependent on arbitrary administrative boundaries once crime and home points are spatially masked. The baseline model is the unmasked conditional logit model of that same specification. Alternative grid and geomasking methods are then compared against that baseline. The summary table below ranks the tested methods by overall deviation from the unmasked result.

This comparison is also relevant from the perspective of spatial units of analysis. A substantial literature on the modifiable areal unit problem shows that statistical relationships can change when phenomena are aggregated into different spatial units, especially when administrative boundaries do not align well with the behavioral processes under study (Fotheringham & Wong, 1991; Openshaw, 1984). In crime analysis specifically, variation across spatial units is often large enough to alter interpretation, which means that masking should be evaluated not only as a privacy intervention but also as a change to the spatial representation through which offender decision-making is measured (Steenbeek & Weisburd, 2016; Weisburd et al., 2012).

The familiarity term is especially sensitive to this issue. A same-ward indicator is convenient when prior offending is recorded by administrative unit, but it can also treat a boundary crossing as a substantive change in offender familiarity even when the masked event remains geographically close to the original location. A boundary-free formulation such as prior_crime_within_1km_any is therefore more robust to small positional shifts and more consistent with arguments in spatial criminology that offenders' awareness spaces and relevant opportunity structures do not necessarily coincide with administrative borders (Bernasco & Nieuwbeerta, 2005; Song et al., 2017).

Table 1: Chennai real-data masking methods ranked against the unmasked baseline

Method

Interpretation

RMSE vs baseline

Max |bias|

Worst term

Crime point stays in same ward (%)

Home point stays in same ward (%)

Grid 250m

Best preservation of the unmasked baseline

0.017

0.038

Marriage halls (# 10)

86.5

87.2

Geomask 50-300m

Very close to the unmasked baseline

0.028

0.075

Mosques (# 10)

79.0

79.3

Grid 500m

Usable, but with some extra distortion

0.040

0.124

Any prior crime within 1 km (0,1)

78.7

79.1

Geomask 200-600m

More noticeable distortion

0.084

0.336

Any prior crime within 1 km (0,1)

59.0

59.2

Grid 1000m

More noticeable distortion

0.091

0.224

Mosques (# 10)

58.1

62.6

Geomask 400-1200m

More noticeable distortion

0.114

0.496

Any prior crime within 1 km (0,1)

34.5

32.0

The best-performing masking method in this analysis is Grid 250m. In these results, Grid 250m gives the closest agreement with the unmasked Chennai baseline, followed by Geomask 50-300m and Grid 500m. The wider masking settings produce more noticeable distortion, but the main substantive pattern of the model remains recognizable.

2. Baseline vs best masking method

The next table compares the unmasked baseline directly against the best-performing masking method. It reports odds ratios, p-values, and the percentage change in the odds ratio. This is the most direct way to judge whether masking changes the substantive interpretation of the Chennai model under the boundary-free familiarity specification.

Table 2: Chennai baseline vs best masking method (Grid 250m)

Term

Baseline OR

Baseline p

Best model

Best-model OR

Best-model p

OR difference

% change from baseline

Direction changed

Significance changed

Distance (km)

0.356

0.000

Grid 250m

0.350

0.000

-0.007

-1.832

FALSE

FALSE

Any prior crime within 1 km (0,1)

10.018

0.000

Grid 250m

9.741

0.000

-0.277

-2.767

FALSE

FALSE

Area (km2)

1.075

0.000

Grid 250m

1.092

0.000

0.017

1.576

FALSE

FALSE

Population (# 1000)

1.004

0.208

Grid 250m

1.003

0.403

-0.001

-0.132

FALSE

FALSE

Retail stores (# 10)

1.010

0.358

Grid 250m

1.013

0.261

0.002

0.228

FALSE

FALSE

Transit stations (# 10)

1.039

0.432

Grid 250m

1.031

0.524

-0.008

-0.727

FALSE

FALSE

Mosques (# 10)

1.073

0.450

Grid 250m

1.093

0.341

0.020

1.859

FALSE

FALSE

Temples (# 10)

1.009

0.760

Grid 250m

0.980

0.494

-0.029

-2.916

TRUE

FALSE

Churches (# 10)

1.182

0.000

Grid 250m

1.161

0.000

-0.021

-1.757

FALSE

FALSE

Education institutions (# 10)

1.072

0.000

Grid 250m

1.066

0.000

-0.005

-0.507

FALSE

FALSE

School and college (# 10)

0.989

0.669

Grid 250m

0.994

0.814

0.005

0.495

FALSE

FALSE

Personal care (# 10)

1.066

0.003

Grid 250m

1.052

0.021

-0.015

-1.379

FALSE

FALSE

Hospitals (# 10)

0.999

0.950

Grid 250m

1.000

0.994

0.001

0.113

TRUE

FALSE

Marriage halls (# 10)

1.123

0.004

Grid 250m

1.168

0.000

0.044

3.921

FALSE

FALSE

Jewelleries (# 10)

1.025

0.163

Grid 250m

1.035

0.045

0.010

1.003

FALSE

TRUE

Textiles (# 10)

0.987

0.128

Grid 250m

0.984

0.058

-0.003

-0.331

FALSE

FALSE

Park (# 10)

1.176

0.017

Grid 250m

1.164

0.024

-0.012

-1.018

FALSE

FALSE

Recreation facilities (# 10)

0.957

0.124

Grid 250m

0.984

0.556

0.026

2.743

FALSE

FALSE

Restaurant (# 10)

1.028

0.033

Grid 250m

1.035

0.009

0.006

0.612

FALSE

FALSE

Government office (# 10)

1.058

0.013

Grid 250m

1.062

0.007

0.004

0.375

FALSE

FALSE

The main findings remain stable under the best masking method. The distance effect remains negative and strong, and the prior_crime_within_1km_any effect remains strongly positive. The core criminological interpretation therefore does not reverse under the best-performing masking specification.

3. Visual comparison of masking performance

The first figure compares odds ratios across the masking scenarios. The dashed line in each panel marks the unmasked baseline estimate. The second figure ranks the methods by overall coefficient deviation from the baseline.

Figure 1: Chennai real-data odds ratio comparison across masking methods. The dashed reference line in each panel marks the unmasked baseline estimate for that term.

Figure 1: Chennai real-data odds ratio comparison across masking methods. The dashed reference line in each panel marks the unmasked baseline estimate for that term.

Figure 2: Chennai real-data masking methods ranked by RMSE of coefficient deviation from the unmasked baseline. Lower values indicate better preservation of the original model.

Figure 2: Chennai real-data masking methods ranked by RMSE of coefficient deviation from the unmasked baseline. Lower values indicate better preservation of the original model.

4. Interpretation

For the Chennai real-data application, the practical question is not which masking method is universally best, but which method preserves the original analytical result most closely. In this run, the smaller masking settings perform best. Grid 250m produces the lowest overall deviation from the unmasked model, with Geomask 50-300m also remaining close. The wider geomasking and coarser grid settings introduce more noticeable coefficient drift, especially for the familiarity term when points cross wards more often.

This should therefore be read as a baseline-deviation study. The goal is to test how far the masked model moves away from the original fitted Chennai result once sensitive crime and home locations are transformed. Because the benchmark is the unmasked model rather than true population parameters, the interpretation is straightforward: smaller RMSE means better analytical preservation.

5. Chennai Part I output files

The Chennai analysis itself is run outside this report. The table below lists the output files currently consumed by Part I.

Table 3: Chennai Part I output files loaded by this report

File

Folder

Size (KB)

Description

Baseline Coefficients

prior within 1km

1.4

Baseline coefficient estimates from the unmasked Chennai model

Baseline Compare Table

prior within 1km

1.0

Paper-style baseline odds-ratio comparison file

Baseline Odds Ratios

prior within 1km

1.8

Baseline odds ratios with confidence intervals

Baseline Vs Best Model

prior within 1km

2.1

Reader-friendly baseline vs best-method comparison

Bias Table

prior within 1km

11.5

Coefficient-level bias table across all masking methods

Coefficient Comparison

prior within 1km

462.0

Odds-ratio comparison plot across masking methods

Deviation Table

prior within 1km

14.3

Detailed deviation of each coefficient from baseline

Easy Method Summary Table

prior within 1km

1.0

Plain-language method ranking summary

Model Coefficients

prior within 1km

9.0

All masked and baseline coefficient estimates

Model Odds Ratios

prior within 1km

11.4

All masked and baseline odds ratios

RMSE Bias Comparison

prior within 1km

73.3

RMSE comparison plot across masking methods

RMSE Summary

prior within 1km

0.4

Method ranking by RMSE and max absolute bias

Ward Shift Summary

prior within 1km

0.2

Ward stability summary after masking

6. Appendix Note: Reduced-Specification Robustness Check

As a robustness check, I also estimated a reduced specification that drops the prior-location term entirely and re-runs the masking comparison on that reduced model. This is not used as the main specification, because it removes an important spatial familiarity mechanism. It is retained as an appendix-style check to show that the main conclusions do not depend entirely on the 1 km familiarity adaptation.

Appendix Table A1: Reduced-specification robustness check

Robustness specification

Best masking method

Best RMSE

Crime point stays in same ward (%)

Home point stays in same ward (%)

Note

Reduced model (drop prior-crime term)

Geomask 50-300m

0.015

79.8

80

Used as appendix robustness check because it drops the prior-location mechanism.

In that reduced model, the best masking method was Geomask 50-300m with an RMSE of about 0.015, which is slightly more stable numerically than the main 1 km familiarity model. However, because it drops the prior-location mechanism altogether, it is treated as a robustness check rather than the preferred substantive specification.


Part I is based on the output set in outputs/chennai_masking_alternatives/prior_within_1km/. The dropped-prior specification in outputs/chennai_masking_alternatives/reduced_no_prior/ is included only as an appendix robustness check.


PART II: Pseudonymization, PV-Number Consistency & Cross-Dataset Linking

This section demonstrates how the anonymization algorithm handles personal identifiers -- PV numbers, RRN (Rijksregisternummer), and person attributes -- so that multiple police datasets can still be linked by researchers after anonymization, without exposing identity.


Cross-zone standardisation - a core motivation for algorithmic anonymization: Belgian police data is produced by approximately 187 local police zones, each with its own database infrastructure and extraction workflow. Without an algorithmic approach, zones anonymize data inconsistently - one zone may redact birth year while another retains it; one may suppress nationality while another codes it differently. Aggregated research across zones then carries systematic measurement error invisible to the researcher. The consistent keyed pseudonymization demonstrated in this section eliminates this problem: HMAC(key, RRN) produces the same pseudonym for the same person regardless of which zone extracted the record. Cross-zone identity linkage becomes reliable and cross-zone statistical comparisons become valid - with no personal data exchange between zones required before pseudonymization.


Publicly documented Belgian criminal-justice data environments extend beyond a single police register. They include local police operational systems such as ISLP/ISLP2, mobile and search layers such as and PoliceSearch, national police reference systems centred on the ANG/BNG and the publicly referenced FEEDIS feed environment, justice systems such as JustCase and JustMask, prison systems such as Sidis Suite, and NICC research or forensic infrastructures such as DOT and be.care. The exact internal schemas of these systems are not publicly documented in full, but the linkage problem is clear: the same person, case, or event may reappear across operational silos under different local formatting conventions and in different data modalities.

These environments combine structured operational records, free-text narratives and legal documents, and temporal and geospatial event data. For that reason, the anonymization problem addressed here is not limited to removing direct identifiers from one table. A practical pipeline must standardize identifiers before transformation, apply deterministic pseudonymization to person and case keys, handle text redaction as a separate task, and protect spatial or temporal fields separately where those fields create re-identification risk. The synthetic ISLP/TAS/SRS2-style inputs used below are therefore best understood as a simplified stand-in for a broader multi-system linkage problem rather than as a claim to reproduce every operational database in full.


Core challenge in Belgian police data: Police information is stored across multiple systems such as local operational police registers, offender and victim files, national police reference systems, judicial case-management environments, prison systems, and forensic research infrastructures. Each record may carry a PV number (Proces-Verbaal) or another case identifier, together with person identifiers that allow the same individual to be linked across events and institutional contexts.

When data are prepared for research, removing names is not sufficient. PV numbers, national registry numbers (Rijksregisternummer/RRN), and date of birth can still support re-identification when they are combined.

The method demonstrated here uses consistent keyed pseudonymization so that:

  • The same person in Dataset A and Dataset B gets the same anonymized ID
  • Researchers can still link records across time and datasets
  • De-pseudonymization is only possible with the secret key, held by the data controller

1. Simulated Raw Police Data - Modelled on Real Belgian Police Records (Three Linked Datasets)

The raw synthetic police-style inputs used below were generated once by scripts/generate_synthetic_demo_data.R and are loaded from data_generated/. This separation was used to keep the report focused on anonymization, pseudonymization, linkage, and output interpretation rather than on repeated data creation during each render.

1.1 Helper: generate realistic Belgian identifiers

1.2 Dataset 1 - Crime Incidents (Feiten Register)

The table below shows the structure of the raw synthetic incident register before pseudonymization. The key analytical fields are the PV number, date, offence type, neighbourhood, and injury/outcome variables.

Table 8: Preview of the raw synthetic incident register (first 4 records). All data are fully synthetic.

pv_number

datum

tijd

delict_type

wijk

letsel

status

2022/GNT/05041

2023-08-04

10:15

Property crime

Wondelgem

Geen

Gesloten

2024/GNT/03266

2022-02-18

19:00

Public order / substance

Sint-Amandsberg

Geen

Gesloten

2022/GNT/05326

2022-03-12

23:30

Property crime

Mariakerke

Geen

Gesloten

2022/GNT/09504

2024-07-07

19:45

Public order / substance

Mariakerke

Geen

Doorverwezen

1.3 Dataset 2 - Offender Records (Dader Register)

Some offenders committed multiple crimes (repeat offenders - same person_id in multiple rows).

The next table highlights why deterministic pseudonymization matters: some individuals appear in multiple PV records, so the research version must preserve within-person linkage without exposing identity.

1.4 Dataset 3 - Victim Records (Slachtoffer Register)

Some victims appear in multiple incidents. Critically, some persons are BOTH offender and victim in different cases - a real pattern in interpersonal violence data.


2. The Problem: Naive Anonymization Breaks Linkage

A naive approach strips direct identifiers (name, RRN, address) and replaces each identifier with a new random ID every time. This is the most common mistake in ad-hoc anonymization.

Why this fails: Random replacement generates a different code for 2022/GNT/00341 in the offender file and a completely different code in the victim file. Researchers cannot join on PV number - the data becomes useless for cross-file analysis.


3. The Solution: Consistent Keyed Pseudonymization

We use HMAC-SHA256, which can be understood here as a standard keyed hashing method. In practice, it turns an identifier such as an RRN or PV number into a stable pseudonym using a secret key. The same input and the same key always produce the same pseudonym; without the key, the original identifier cannot be read back directly from the output.

This is the approach recommended by ENISA (ENISA, 2019) and compatible with GDPR Art. 4(5) pseudonymization.

3.1 Apply consistent pseudonymization to all three datasets

3.2 Preview: what the pseudonymized data looks like

The researcher-facing files shown below retain analytical fields and stable pseudonyms, but remove direct identifiers such as names, RRN, and raw PV numbers.

Table 9: Preview of the pseudonymized offender file (first 5 records)

pv_pseudo

person_pseudo

person_pseudo_univ

geslacht

nationaliteit

burgelijke_staat

leeftijdsgroep

rol

gekend_bij_pz

PV-841A5F319D8B

D-F23B9BA25B6E

PRS-F23B9BA25B6E

V

Duits

Samenwonend

Verdachte

Nee

PV-85A517390E8F

D-523BA24A1825

PRS-523BA24A1825

M

Congolees

Samenwonend

Verdachte

Ja

PV-7E79B32DE9FE

D-1895C0DB1C22

PRS-1895C0DB1C22

M

Roemeens

Samenwonend

Verdachte

Nee

PV-01FB2838F77F

D-80C0220A9292

PRS-80C0220A9292

V

Nederlands

Gescheiden

Verdachte

Nee

PV-CE1B202E69F3

D-AC4CC429F7F3

PRS-AC4CC429F7F3

V

Belgisch

Gehuwd

Verdachte

Ja

Table 10: Preview of the pseudonymized victim file (first 5 records)

pv_pseudo

person_pseudo

person_pseudo_univ

geslacht

nationaliteit

burgelijke_staat

leeftijdsgroep

rol

relatie_dader

PV-BDD76BBF3F1C

S-F71BABA98A2E

PRS-F71BABA98A2E

M

Duits

Weduwe/Weduwnaar

Slachtoffer

Kennis

PV-C19BF58E43D7

S-ACF088B1C241

PRS-ACF088B1C241

M

Pools

Gehuwd

Slachtoffer

Onbekend

PV-661E861336D7

S-AB252BC3C6E5

PRS-AB252BC3C6E5

M

Frans

Gehuwd

Slachtoffer

Collega

PV-B87EB997C714

S-7DF963006DA7

PRS-7DF963006DA7

M

Duits

Ongehuwd

Slachtoffer

Familielid

PV-7053CD96B401

S-3D6FEB2A6CA2

PRS-3D6FEB2A6CA2

M

Duits

Gescheiden

Slachtoffer

Kennis


4. Demonstrate: Cross-Dataset Linking Still Works

The core claim of consistent keyed pseudonymization is that all analytical joins work identically before and after anonymization. The scorecard below verifies this explicitly.

Linkage integrity scorecard: the intended analytical joins are preserved after pseudonymization in this synthetic demonstration

Join type

Raw data

After pseudonymization

Preserved?

Incidents row count

120

120

✓ 100%

Unique PV numbers (incidents)

120

120

✓ 100%

Unique offender persons (by RRN/pseudonym)

42

42

✓ 100%

Offender → Incident joins (PV number)

85

85

✓ 100%

Offender → Victim joins (same person)

4

4

✓ 100%

How to read this: The Raw data column counts joins using real identifiers (RRN, PV number). The After pseudonymization column counts the same joins using pseudonyms. Identical counts show that, in this synthetic example and with a consistent key, the intended links are preserved after direct identifiers are transformed. This does not mean the release is risk-free: quasi-identifier risk remains and is evaluated in the next sections.

4.2 Find the same person across offender AND victim datasets

This is the most sensitive use case: a person who was an offender in one case and a victim in another. Using the universal person pseudonym (person_pseudo_univ), researchers can trace this without ever knowing who the person is.

Table 12: Persons with BOTH offender and victim roles — linked by universal pseudonym without knowing their identity

Person pseudonym

Offender PV

Role (offender)

Victim PV

Role (victim)

Relation to suspect

PRS-64439954BA3B

PV-8CAE09F42E9D

Verdachte

PV-ED87DE7C95D7

Slachtoffer

Onbekend

PRS-D2811785D27F

PV-EF8DB40B5816

Verdachte

PV-24464F55C289

Slachtoffer

Onbekend

PRS-D2811785D27F

PV-EF8DB40B5816

Verdachte

PV-23FFB7E4AD3D

Slachtoffer

Onbekend

PRS-1E4DC75EB6FF

PV-6DFFC4C79251

Verdachte

PV-21358D6B89AD

Slachtoffer

Partner

4.3 Track a repeat offender across multiple PV numbers

Table 13: Criminal history of offender PRS-12DC66B62F83 — identity unknown to researcher

Person pseudonym

PV pseudonym

Date

Offence type

Neighbourhood

Gender

Age group

Nationality

Prior PVs

PRS-12DC66B62F83

PV-795D706A47D6

2021-05-23

Violence / interpersonal

Muide

M

Belgisch

0

PRS-12DC66B62F83

PV-661E861336D7

2021-08-01

Violence / interpersonal

Sint-Amandsberg

M

Belgisch

3

PRS-12DC66B62F83

PV-CC570D1B8B80

2021-08-02

Violence / interpersonal

Gentbrugge

M

Belgisch

3

PRS-12DC66B62F83

PV-E871AF0588D2

2023-08-19

Violence / interpersonal

Mariakerke

M

Belgisch

1

PRS-12DC66B62F83

PV-7B70B46C910C

2023-09-02

Property crime

Ledeberg

M

Belgisch

1

PRS-12DC66B62F83

PV-A9921AD699D6

2023-11-13

Public order / substance

Gentbrugge

M

Belgisch

3

PRS-12DC66B62F83

PV-9A654E2DCB93

2024-04-04

Property crime

Gentbrugge

M

Belgisch

2


5. Re-identification Risk Analysis

Even after removing direct identifiers, quasi-identifiers (age group, nationality, marital status, gender together) can make individuals unique - the combination re-identifies them (Sweeney 2002: 87% of Americans uniquely identified by ZIP + DOB + sex).

5.1 Uniqueness by quasi-identifier combination

Table 14: Re-identification risk in the raw offender and victim files

Dataset

Records

Unique combos

% unique

Small group (n)

% small group

Offenders (raw)

85

9

30%

39

46%

Victims (raw)

110

9

21%

63

57%

Table 15: Re-identification risk after pseudonymization

Dataset

Records

Unique combos

% unique

Small group (n)

% small group

Offenders (pseudonymized)

85

9

30%

39

46%

Victims (pseudonymized)

110

9

21%

63

57%

5.2 k-Anonymity check per nationality/age group cell

These two summary tables show that pseudonymization removes direct identifiers, but does not on its own eliminate all uniqueness risk in quasi-identifier combinations. That is why the k-anonymity check below is still needed.

## No rows in ka_check - skipping plot.

5.3 Apply k-anonymity suppression (k >= 5 threshold)

## Offenders - nationality-generalised records: 55 of 85 
## Offenders - age-group-suppressed records:   0 of 85
## Victims - nationality-generalised records: 83 of 110 
## Victims - age-group-suppressed records:   0 of 110

6. End-to-End Research Workflow

This is the complete pipeline a researcher receives. They have no access to real identifiers - only pseudonymous IDs and generalised attributes - yet they can perform full longitudinal and cross-dataset analysis.

Figure 11: Offence type by age group derived from the pseudonymized researcher dataset (Part II). This chart demonstrates that cross-variable analysis remains fully possible: offender age groups (from pseudonymized records) are linked to incident crime types via the PV pseudonym, without any direct identifier being present in the data.

Figure 11: Offence type by age group derived from the pseudonymized researcher dataset (Part II). This chart demonstrates that cross-variable analysis remains fully possible: offender age groups (from pseudonymized records) are linked to incident crime types via the PV pseudonym, without any direct identifier being present in the data.

Figure 12: Crime type by neighbourhood heatmap derived from the pseudonymized researcher dataset (Part II). Full analytical detail is preserved: no direct identifiers remain, yet cross-variable tabulation across neighbourhoods and offence types is unimpeded.

Figure 12: Crime type by neighbourhood heatmap derived from the pseudonymized researcher dataset (Part II). Full analytical detail is preserved: no direct identifiers remain, yet cross-variable tabulation across neighbourhoods and offence types is unimpeded.


7. Additional Release Checks

This section keeps only the additional release checks that are not already visible in the linkage tables above. The core point is straightforward: deterministic pseudonymization preserves joins, but a research release still needs decisions about timestamp precision and free-text redaction before it can be shared safely.

7.1 Temporal precision

Exact timestamps can become identifying when they are combined with offence category, neighbourhood, or person-level attributes. For that reason, the practical question is not whether time should be dropped entirely, but how far it should be coarsened before release.

Table 16: Temporal precision vs. re-identification risk. Coarsening timestamps to time-of-day bands substantially reduces uniqueness while preserving analytically relevant crime patterns.

Precision level

Unique combinations

Total records

% unique

Re-id risk

Exact (date + HH:MM)

119

120

99.2

High

Date + hour of day (24)

118

120

98.3

High

Date + time band (4)

117

120

97.5

High

Week + time band

108

120

90.0

High

Month + time band

87

120

72.5

High


7.2 Narrative fields

Structured pseudonymization is not enough when police records include narrative text. Names, RRNs, addresses, and times often remain embedded in natural language fields and therefore require a separate redaction step before a research extract can be released.

Table 17: Free-text narrative anonymization — raw police narrative (left) vs. redacted output (right). In production, Dynizer's neuro-symbolic AI performs entity identification with substantially higher recall than the regex baseline shown here.

Record ID

Raw narrative

Redacted narrative

PV-A1B2

Op 15/03/2022 om 23:45 werd Jan De Smedt, RRN 86.04.12-234.71, wonende Langestraat 42, 9000 Gent, aangetroffen.

Op [DATUM] om [TIJD] werd [NAAM], RRN [RRN], wonende [ADRES], [POSTCODE], aangetroffen.

PV-C3D4

Het voertuig werd bestuurd door Mohamed El Amrani, geboortedatum 04/04/1990, RRN 90.04.04-123.45, uit Merelbeke.

Het voertuig werd bestuurd door [NAAM], geboortedatum [DATUM], RRN [RRN], uit Merelbeke.

PV-E5F6

Slachtoffer Emma Vandenberghe (geb. 12-06-1978), verblijvend te Veldstraat 18, Gent, deed aangifte van diefstal.

[NAAM] (geb. [DATUM]), verblijvend te [ADRES], Gent, deed aangifte van diefstal.

PV-G7H8

Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om 02:30.

Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om [TIJD].

Interpretation: The regex example is only a baseline used to illustrate the transformation logic. In practice, entity detection in free text needs a stronger NLP layer because names, addresses, and times appear in many formats that simple patterns do not capture reliably.


8. GDPR Compliance, Project Challenges & Research Approach

8.2 Constraints Demonstrated in This Report

The report does not demonstrate every operational challenge of a full police-data release system, but it does show three constraints that remain central in practice. First, spatial masking does not affect every criminological analysis in the same way: in Part I the logdistance and prior_crime_within_1km_any terms are more sensitive to point displacement than several ward-level opportunity covariates, so masking has to be validated against the target analysis rather than chosen abstractly. Second, linked files remain usable only when pseudonymization is deterministic, because joins fail if key use or identifier formatting varies across extracts. Third, even after direct identifiers are removed, linked researcher files can still contain rare quasi-identifier combinations, which is why suppression and coarsening remain necessary before release.


8.3 How My Background Is Relevant to These Constraints

The relevance of my background to this project lies less in claiming a complete production solution and more in bringing together the parts already demonstrated here: spatial criminological analysis, reproducible implementation, and structured disclosure-control thinking.

Spatial criminology, scale, and model sensitivity Work on crime location choice and related spatial criminological questions is directly relevant to the first constraint above. It also includes attention to how changing spatial scale affects criminological interpretation. That matters because anonymization decisions should be tied to analytical consequences, not only to abstract privacy principles.

Linked-data handling and pseudonymization logic Experience with crime data and familiarity with police-style file structures are relevant to the second constraint. A linked release is only useful when identifiers are handled consistently across files and over time. The pseudonymization section of this report was built to show exactly that issue: the technical task is not only to hide names, but to preserve the joins researchers actually need.

Reproducible implementation and risk auditing Work in R, sf, simulation, and reproducible workflows is relevant to the third constraint because quasi-identifier risk is not something that should be checked informally. It has to be implemented, measured, documented, and rerun when release conditions change. That is the part of the project where statistical reasoning and implementation work meet most clearly.


Security note on key mapping table: A mapping between real identifiers (RRN, PV numbers) and their pseudonyms has been written to outputs/Part_II_Pseudonymization/secure_vault/. This file is for police system use only. It must never be shared with researchers or stored in the research environment. In production, this table would be held in the data controller's HSM or certified key vault, satisfying GDPR Art. 32.


All names, RRN numbers, and case details in this section are fully synthetic. No real persons are represented.


References

Academic sources

Bernasco, W., & Nieuwbeerta, P. (2005). How do residential burglars select target areas? British Journal of Criminology, 45(3), 296-315. https://doi.org/10.1093/bjc/azi005

European Union Agency for Cybersecurity. (2019). Pseudonymisation techniques and best practices: Recommendations on shaping technology according to data protection and privacy provisions. Publications Office of the European Union.

Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23(7), 1025-1044. https://doi.org/10.1068/a231025

Kuralarasan, K., & Bernasco, W. (2022). Location choice of snatching offenders in Chennai City. Journal of Quantitative Criminology, 38, 673-696. https://doi.org/10.1007/s10940-021-09514-9

Openshaw, S. (1984). The modifiable areal unit problem. Geo Books.

Song, G., Bernasco, W., Liu, L., Xiao, L., Zhou, S., & Liao, W. (2017). Crime feeds on legal activities: Daily mobility flows help to explain thieves' target location choices. Journal of Quantitative Criminology, 33(4), 831-854. https://doi.org/10.1007/s10940-016-9326-0

Steenbeek, W., & Weisburd, D. (2016). Where the action is in crime? An examination of variability of crime across different spatial units in The Hague, 2001-2009. Journal of Quantitative Criminology, 32(3), 449-469. https://doi.org/10.1007/s10940-015-9276-3

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570. https://doi.org/10.1142/S0218488502001648

Vandeviver, C., Van Daele, S., & Vander Beken, T. (2015). What makes long crime trips worth undertaking? Balancing costs and benefits in burglars' journey to crime decisions. British Journal of Criminology, 55(2), 399-420. https://doi.org/10.1093/bjc/azu093

Weisburd, D., Groff, E. R., & Yang, S.-M. (2012). The criminology of place: Street segments and our understanding of the crime problem. Oxford University Press.

Legal sources

Belgian Federal Government. (1992). Wet op het Politieambt [Belgian Police Act]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

Belgian Federal Government. (2018). Wet van 30 juli 2018 betreffende de bescherming van natuurlijke personen met betrekking tot de verwerking van persoonsgegevens [Law of 30 July 2018 on the protection of natural persons with regard to the processing of personal data]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/

European Parliament and Council of the European Union. (2016a). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88. https://eur-lex.europa.eu/eli/reg/2016/679/oj

European Parliament and Council of the European Union. (2016b). Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data. Official Journal of the European Union, L 119, 89-131. https://eur-lex.europa.eu/eli/dir/2016/680/oj