Scope of the report
In criminological research using crime datasets and related administrative records, two disclosure problems usually have to be addressed separately: sensitive locations and identifiers that support linkage. This document combines two linked demonstrations that address each in turn:
Part I addresses spatial masking and analytical distortion. Part II addresses identifier protection and cross-file linkage.
Part I presents an empirical Chennai application
based on the output set in
outputs/chennai_masking_alternatives/prior_within_1km/. The
main specification operationalizes prior spatial familiarity with the
boundary-free term prior_crime_within_1km_any and compares
masked estimates against the unmasked baseline of that same
specification. A reduced specification that drops the prior-location
term entirely is retained as an appendix robustness check.
Part II (Sections 1--8 of the pseudonymization chapter) presents the identifier-management side of the same pipeline. Three linked synthetic administrative datasets -- incident register, offender records, victim records -- were structured as Belgian police-style linked files, drawing on systems such as ISLP and related police databases. Direct identifiers such as name, RRN, address, and PV number were replaced using keyed pseudonymization. The section then shows whether the intended joins remain valid, whether quasi-identifier risk can be detected and suppressed through k-anonymity, and how temporal and free-text fields can be handled within the same workflow.
Part I presents a real-data Chennai masking analysis
based on the published snatching location choice study. The underlying
script and output files are located in
scripts/chennai_alternative_masking_analysis.R and
outputs/chennai_masking_alternatives/. Because this is a
real-data setting, the correct benchmark is the unmasked
baseline model, not a set of known true parameters.
How Part I should be read: The masking methods are
compared against the unmasked Chennai baseline of the same
specification. Lower RMSE means the masked model remains closer to the
original fitted result. In the main Chennai specification, both crime
points and offender-home points are masked, crime wards are reassigned
after masking, and prior spatial familiarity is measured with
prior_crime_within_1km_any rather than a same-ward
indicator.
The Chennai analysis uses the real snatching data and an adapted
published-style Model 2 structure used in
scripts/chennai_alternative_masking_analysis.R: distance,
prior_crime_within_1km_any, area, population, and the
ward-level opportunity covariates. This familiarity measure is used
instead of the original same-ward prior-crime indicator so that the
model is less dependent on arbitrary administrative boundaries once
crime and home points are spatially masked. The baseline model is the
unmasked conditional logit model of that same specification. Alternative
grid and geomasking methods are then compared against that baseline. The
summary table below ranks the tested methods by overall deviation from
the unmasked result.
This comparison is also relevant from the perspective of spatial units of analysis. A substantial literature on the modifiable areal unit problem shows that statistical relationships can change when phenomena are aggregated into different spatial units, especially when administrative boundaries do not align well with the behavioral processes under study (Fotheringham & Wong, 1991; Openshaw, 1984). In crime analysis specifically, variation across spatial units is often large enough to alter interpretation, which means that masking should be evaluated not only as a privacy intervention but also as a change to the spatial representation through which offender decision-making is measured (Steenbeek & Weisburd, 2016; Weisburd et al., 2012).
The familiarity term is especially sensitive to this issue. A
same-ward indicator is convenient when prior offending is recorded by
administrative unit, but it can also treat a boundary crossing as a
substantive change in offender familiarity even when the masked event
remains geographically close to the original location. A boundary-free
formulation such as prior_crime_within_1km_any is therefore
more robust to small positional shifts and more consistent with
arguments in spatial criminology that offenders' awareness spaces and
relevant opportunity structures do not necessarily coincide with
administrative borders (Bernasco & Nieuwbeerta, 2005; Song et al.,
2017).
Method | Interpretation | RMSE vs baseline | Max |bias| | Worst term | Crime point stays in same ward (%) | Home point stays in same ward (%) |
|---|---|---|---|---|---|---|
Grid 250m | Best preservation of the unmasked baseline | 0.017 | 0.038 | Marriage halls (# 10) | 86.5 | 87.2 |
Geomask 50-300m | Very close to the unmasked baseline | 0.028 | 0.075 | Mosques (# 10) | 79.0 | 79.3 |
Grid 500m | Usable, but with some extra distortion | 0.040 | 0.124 | Any prior crime within 1 km (0,1) | 78.7 | 79.1 |
Geomask 200-600m | More noticeable distortion | 0.084 | 0.336 | Any prior crime within 1 km (0,1) | 59.0 | 59.2 |
Grid 1000m | More noticeable distortion | 0.091 | 0.224 | Mosques (# 10) | 58.1 | 62.6 |
Geomask 400-1200m | More noticeable distortion | 0.114 | 0.496 | Any prior crime within 1 km (0,1) | 34.5 | 32.0 |
The best-performing masking method in this analysis is Grid
250m. In these results, Grid 250m gives the
closest agreement with the unmasked Chennai baseline, followed by
Geomask 50-300m and Grid 500m. The wider
masking settings produce more noticeable distortion, but the main
substantive pattern of the model remains recognizable.
The next table compares the unmasked baseline directly against the best-performing masking method. It reports odds ratios, p-values, and the percentage change in the odds ratio. This is the most direct way to judge whether masking changes the substantive interpretation of the Chennai model under the boundary-free familiarity specification.
Term | Baseline OR | Baseline p | Best model | Best-model OR | Best-model p | OR difference | % change from baseline | Direction changed | Significance changed |
|---|---|---|---|---|---|---|---|---|---|
Distance (km) | 0.356 | 0.000 | Grid 250m | 0.350 | 0.000 | -0.007 | -1.832 | FALSE | FALSE |
Any prior crime within 1 km (0,1) | 10.018 | 0.000 | Grid 250m | 9.741 | 0.000 | -0.277 | -2.767 | FALSE | FALSE |
Area (km2) | 1.075 | 0.000 | Grid 250m | 1.092 | 0.000 | 0.017 | 1.576 | FALSE | FALSE |
Population (# 1000) | 1.004 | 0.208 | Grid 250m | 1.003 | 0.403 | -0.001 | -0.132 | FALSE | FALSE |
Retail stores (# 10) | 1.010 | 0.358 | Grid 250m | 1.013 | 0.261 | 0.002 | 0.228 | FALSE | FALSE |
Transit stations (# 10) | 1.039 | 0.432 | Grid 250m | 1.031 | 0.524 | -0.008 | -0.727 | FALSE | FALSE |
Mosques (# 10) | 1.073 | 0.450 | Grid 250m | 1.093 | 0.341 | 0.020 | 1.859 | FALSE | FALSE |
Temples (# 10) | 1.009 | 0.760 | Grid 250m | 0.980 | 0.494 | -0.029 | -2.916 | TRUE | FALSE |
Churches (# 10) | 1.182 | 0.000 | Grid 250m | 1.161 | 0.000 | -0.021 | -1.757 | FALSE | FALSE |
Education institutions (# 10) | 1.072 | 0.000 | Grid 250m | 1.066 | 0.000 | -0.005 | -0.507 | FALSE | FALSE |
School and college (# 10) | 0.989 | 0.669 | Grid 250m | 0.994 | 0.814 | 0.005 | 0.495 | FALSE | FALSE |
Personal care (# 10) | 1.066 | 0.003 | Grid 250m | 1.052 | 0.021 | -0.015 | -1.379 | FALSE | FALSE |
Hospitals (# 10) | 0.999 | 0.950 | Grid 250m | 1.000 | 0.994 | 0.001 | 0.113 | TRUE | FALSE |
Marriage halls (# 10) | 1.123 | 0.004 | Grid 250m | 1.168 | 0.000 | 0.044 | 3.921 | FALSE | FALSE |
Jewelleries (# 10) | 1.025 | 0.163 | Grid 250m | 1.035 | 0.045 | 0.010 | 1.003 | FALSE | TRUE |
Textiles (# 10) | 0.987 | 0.128 | Grid 250m | 0.984 | 0.058 | -0.003 | -0.331 | FALSE | FALSE |
Park (# 10) | 1.176 | 0.017 | Grid 250m | 1.164 | 0.024 | -0.012 | -1.018 | FALSE | FALSE |
Recreation facilities (# 10) | 0.957 | 0.124 | Grid 250m | 0.984 | 0.556 | 0.026 | 2.743 | FALSE | FALSE |
Restaurant (# 10) | 1.028 | 0.033 | Grid 250m | 1.035 | 0.009 | 0.006 | 0.612 | FALSE | FALSE |
Government office (# 10) | 1.058 | 0.013 | Grid 250m | 1.062 | 0.007 | 0.004 | 0.375 | FALSE | FALSE |
The main findings remain stable under the best masking method. The
distance effect remains negative and strong, and the
prior_crime_within_1km_any effect remains strongly
positive. The core criminological interpretation therefore does
not reverse under the best-performing masking
specification.
The first figure compares odds ratios across the masking scenarios. The dashed line in each panel marks the unmasked baseline estimate. The second figure ranks the methods by overall coefficient deviation from the baseline.
Figure 1: Chennai real-data odds ratio comparison across masking methods. The dashed reference line in each panel marks the unmasked baseline estimate for that term.
Figure 2: Chennai real-data masking methods ranked by RMSE of coefficient deviation from the unmasked baseline. Lower values indicate better preservation of the original model.
For the Chennai real-data application, the practical question is not
which masking method is universally best, but which method preserves the
original analytical result most closely. In this run, the smaller
masking settings perform best. Grid 250m produces the
lowest overall deviation from the unmasked model, with
Geomask 50-300m also remaining close. The wider geomasking
and coarser grid settings introduce more noticeable coefficient drift,
especially for the familiarity term when points cross wards more
often.
This should therefore be read as a baseline-deviation study. The goal is to test how far the masked model moves away from the original fitted Chennai result once sensitive crime and home locations are transformed. Because the benchmark is the unmasked model rather than true population parameters, the interpretation is straightforward: smaller RMSE means better analytical preservation.
The Chennai analysis itself is run outside this report. The table below lists the output files currently consumed by Part I.
File | Folder | Size (KB) | Description |
|---|---|---|---|
Baseline Coefficients | prior within 1km | 1.4 | Baseline coefficient estimates from the unmasked Chennai model |
Baseline Compare Table | prior within 1km | 1.0 | Paper-style baseline odds-ratio comparison file |
Baseline Odds Ratios | prior within 1km | 1.8 | Baseline odds ratios with confidence intervals |
Baseline Vs Best Model | prior within 1km | 2.1 | Reader-friendly baseline vs best-method comparison |
Bias Table | prior within 1km | 11.5 | Coefficient-level bias table across all masking methods |
Coefficient Comparison | prior within 1km | 462.0 | Odds-ratio comparison plot across masking methods |
Deviation Table | prior within 1km | 14.3 | Detailed deviation of each coefficient from baseline |
Easy Method Summary Table | prior within 1km | 1.0 | Plain-language method ranking summary |
Model Coefficients | prior within 1km | 9.0 | All masked and baseline coefficient estimates |
Model Odds Ratios | prior within 1km | 11.4 | All masked and baseline odds ratios |
RMSE Bias Comparison | prior within 1km | 73.3 | RMSE comparison plot across masking methods |
RMSE Summary | prior within 1km | 0.4 | Method ranking by RMSE and max absolute bias |
Ward Shift Summary | prior within 1km | 0.2 | Ward stability summary after masking |
As a robustness check, I also estimated a reduced specification that drops the prior-location term entirely and re-runs the masking comparison on that reduced model. This is not used as the main specification, because it removes an important spatial familiarity mechanism. It is retained as an appendix-style check to show that the main conclusions do not depend entirely on the 1 km familiarity adaptation.
Robustness specification | Best masking method | Best RMSE | Crime point stays in same ward (%) | Home point stays in same ward (%) | Note |
|---|---|---|---|---|---|
Reduced model (drop prior-crime term) | Geomask 50-300m | 0.015 | 79.8 | 80 | Used as appendix robustness check because it drops the prior-location mechanism. |
In that reduced model, the best masking method was Geomask
50-300m with an RMSE of about 0.015, which is
slightly more stable numerically than the main 1 km familiarity model.
However, because it drops the prior-location mechanism altogether, it is
treated as a robustness check rather than the preferred substantive
specification.
Part I is based on the output set in
outputs/chennai_masking_alternatives/prior_within_1km/. The
dropped-prior specification in
outputs/chennai_masking_alternatives/reduced_no_prior/ is
included only as an appendix robustness check.
This section demonstrates how the anonymization algorithm handles personal identifiers -- PV numbers, RRN (Rijksregisternummer), and person attributes -- so that multiple police datasets can still be linked by researchers after anonymization, without exposing identity.
Cross-zone standardisation - a core motivation for algorithmic anonymization: Belgian police data is produced by approximately 187 local police zones, each with its own database infrastructure and extraction workflow. Without an algorithmic approach, zones anonymize data inconsistently - one zone may redact birth year while another retains it; one may suppress nationality while another codes it differently. Aggregated research across zones then carries systematic measurement error invisible to the researcher. The consistent keyed pseudonymization demonstrated in this section eliminates this problem: HMAC(key, RRN) produces the same pseudonym for the same person regardless of which zone extracted the record. Cross-zone identity linkage becomes reliable and cross-zone statistical comparisons become valid - with no personal data exchange between zones required before pseudonymization.
Publicly documented Belgian criminal-justice data environments extend beyond a single police register. They include local police operational systems such as ISLP/ISLP2, mobile and search layers such as FOCUS@GPI and PoliceSearch, national police reference systems centred on the ANG/BNG and the publicly referenced FEEDIS feed environment, justice systems such as JustCase and JustMask, prison systems such as Sidis Suite, and NICC research or forensic infrastructures such as DOT and be.care. The exact internal schemas of these systems are not publicly documented in full, but the linkage problem is clear: the same person, case, or event may reappear across operational silos under different local formatting conventions and in different data modalities.
These environments combine structured operational records, free-text narratives and legal documents, and temporal and geospatial event data. For that reason, the anonymization problem addressed here is not limited to removing direct identifiers from one table. A practical pipeline must standardize identifiers before transformation, apply deterministic pseudonymization to person and case keys, handle text redaction as a separate task, and protect spatial or temporal fields separately where those fields create re-identification risk. The synthetic ISLP/TAS/SRS2-style inputs used below are therefore best understood as a simplified stand-in for a broader multi-system linkage problem rather than as a claim to reproduce every operational database in full.
Core challenge in Belgian police data: Police information is stored across multiple systems such as local operational police registers, offender and victim files, national police reference systems, judicial case-management environments, prison systems, and forensic research infrastructures. Each record may carry a PV number (Proces-Verbaal) or another case identifier, together with person identifiers that allow the same individual to be linked across events and institutional contexts.
When data are prepared for research, removing names is not sufficient. PV numbers, national registry numbers (Rijksregisternummer/RRN), and date of birth can still support re-identification when they are combined.
The method demonstrated here uses consistent keyed pseudonymization so that:
- The same person in Dataset A and Dataset B gets the same anonymized ID
- Researchers can still link records across time and datasets
- De-pseudonymization is only possible with the secret key, held by the data controller
The raw synthetic police-style inputs used below were generated once
by scripts/generate_synthetic_demo_data.R and are loaded
from data_generated/. This separation was used to keep the
report focused on anonymization, pseudonymization, linkage, and output
interpretation rather than on repeated data creation during each
render.
The table below shows the structure of the raw synthetic incident register before pseudonymization. The key analytical fields are the PV number, date, offence type, neighbourhood, and injury/outcome variables.
pv_number | datum | tijd | delict_type | wijk | letsel | status |
|---|---|---|---|---|---|---|
2022/GNT/05041 | 2023-08-04 | 10:15 | Property crime | Wondelgem | Geen | Gesloten |
2024/GNT/03266 | 2022-02-18 | 19:00 | Public order / substance | Sint-Amandsberg | Geen | Gesloten |
2022/GNT/05326 | 2022-03-12 | 23:30 | Property crime | Mariakerke | Geen | Gesloten |
2022/GNT/09504 | 2024-07-07 | 19:45 | Public order / substance | Mariakerke | Geen | Doorverwezen |
Some offenders committed multiple crimes (repeat offenders - same person_id in multiple rows).
The next table highlights why deterministic pseudonymization matters: some individuals appear in multiple PV records, so the research version must preserve within-person linkage without exposing identity.
Some victims appear in multiple incidents. Critically, some persons are BOTH offender and victim in different cases - a real pattern in interpersonal violence data.
A naive approach strips direct identifiers (name, RRN, address) and replaces each identifier with a new random ID every time. This is the most common mistake in ad-hoc anonymization.
Why this fails: Random replacement generates a different code for
2022/GNT/00341in the offender file and a completely different code in the victim file. Researchers cannot join on PV number - the data becomes useless for cross-file analysis.
We use HMAC-SHA256, which can be understood here as a standard keyed hashing method. In practice, it turns an identifier such as an RRN or PV number into a stable pseudonym using a secret key. The same input and the same key always produce the same pseudonym; without the key, the original identifier cannot be read back directly from the output.
This is the approach recommended by ENISA (ENISA, 2019) and compatible with GDPR Art. 4(5) pseudonymization.
The researcher-facing files shown below retain analytical fields and stable pseudonyms, but remove direct identifiers such as names, RRN, and raw PV numbers.
pv_pseudo | person_pseudo | person_pseudo_univ | geslacht | nationaliteit | burgelijke_staat | leeftijdsgroep | rol | gekend_bij_pz |
|---|---|---|---|---|---|---|---|---|
PV-841A5F319D8B | D-F23B9BA25B6E | PRS-F23B9BA25B6E | V | Duits | Samenwonend | Verdachte | Nee | |
PV-85A517390E8F | D-523BA24A1825 | PRS-523BA24A1825 | M | Congolees | Samenwonend | Verdachte | Ja | |
PV-7E79B32DE9FE | D-1895C0DB1C22 | PRS-1895C0DB1C22 | M | Roemeens | Samenwonend | Verdachte | Nee | |
PV-01FB2838F77F | D-80C0220A9292 | PRS-80C0220A9292 | V | Nederlands | Gescheiden | Verdachte | Nee | |
PV-CE1B202E69F3 | D-AC4CC429F7F3 | PRS-AC4CC429F7F3 | V | Belgisch | Gehuwd | Verdachte | Ja |
pv_pseudo | person_pseudo | person_pseudo_univ | geslacht | nationaliteit | burgelijke_staat | leeftijdsgroep | rol | relatie_dader |
|---|---|---|---|---|---|---|---|---|
PV-BDD76BBF3F1C | S-F71BABA98A2E | PRS-F71BABA98A2E | M | Duits | Weduwe/Weduwnaar | Slachtoffer | Kennis | |
PV-C19BF58E43D7 | S-ACF088B1C241 | PRS-ACF088B1C241 | M | Pools | Gehuwd | Slachtoffer | Onbekend | |
PV-661E861336D7 | S-AB252BC3C6E5 | PRS-AB252BC3C6E5 | M | Frans | Gehuwd | Slachtoffer | Collega | |
PV-B87EB997C714 | S-7DF963006DA7 | PRS-7DF963006DA7 | M | Duits | Ongehuwd | Slachtoffer | Familielid | |
PV-7053CD96B401 | S-3D6FEB2A6CA2 | PRS-3D6FEB2A6CA2 | M | Duits | Gescheiden | Slachtoffer | Kennis |
The core claim of consistent keyed pseudonymization is that all analytical joins work identically before and after anonymization. The scorecard below verifies this explicitly.
Join type | Raw data | After pseudonymization | Preserved? |
|---|---|---|---|
Incidents row count | 120 | 120 | ✓ 100% |
Unique PV numbers (incidents) | 120 | 120 | ✓ 100% |
Unique offender persons (by RRN/pseudonym) | 42 | 42 | ✓ 100% |
Offender → Incident joins (PV number) | 85 | 85 | ✓ 100% |
Offender → Victim joins (same person) | 4 | 4 | ✓ 100% |
How to read this: The Raw data column counts joins using real identifiers (RRN, PV number). The After pseudonymization column counts the same joins using pseudonyms. Identical counts show that, in this synthetic example and with a consistent key, the intended links are preserved after direct identifiers are transformed. This does not mean the release is risk-free: quasi-identifier risk remains and is evaluated in the next sections.
PV pseudonym | Person pseudonym | Gender | Age group | Nationality | Prior PVs | Known to police | Date | Offence type | Neighbourhood | Injury |
|---|---|---|---|---|---|---|---|---|---|---|
PV-841A5F319D8B | PRS-F23B9BA25B6E | V | Duits | 0 | Nee | 2021-07-01 | Property crime | Muide | Zwaar | |
PV-85A517390E8F | PRS-523BA24A1825 | M | Congolees | 1 | Ja | 2022-07-28 | Public order / substance | Sint-Amandsberg | Geen | |
PV-7E79B32DE9FE | PRS-1895C0DB1C22 | M | Roemeens | 0 | Nee | 2022-02-18 | Public order / substance | Sint-Amandsberg | Geen | |
PV-01FB2838F77F | PRS-80C0220A9292 | V | Nederlands | 0 | Nee | 2023-11-21 | Public order / substance | Bloemekenswijk | Licht | |
PV-CE1B202E69F3 | PRS-AC4CC429F7F3 | V | Belgisch | 1 | Ja | 2021-08-31 | Property crime | Muide | Geen | |
PV-B87EB997C714 | PRS-4700C7A200C8 | M | Congolees | 1 | Nee | 2022-12-23 | Public order / substance | Sint-Amandsberg | Licht |
This is the most sensitive use case: a person who was an offender in
one case and a victim in another. Using the universal person
pseudonym (person_pseudo_univ), researchers can
trace this without ever knowing who the person is.
Person pseudonym | Offender PV | Role (offender) | Victim PV | Role (victim) | Relation to suspect |
|---|---|---|---|---|---|
PRS-64439954BA3B | PV-8CAE09F42E9D | Verdachte | PV-ED87DE7C95D7 | Slachtoffer | Onbekend |
PRS-D2811785D27F | PV-EF8DB40B5816 | Verdachte | PV-24464F55C289 | Slachtoffer | Onbekend |
PRS-D2811785D27F | PV-EF8DB40B5816 | Verdachte | PV-23FFB7E4AD3D | Slachtoffer | Onbekend |
PRS-1E4DC75EB6FF | PV-6DFFC4C79251 | Verdachte | PV-21358D6B89AD | Slachtoffer | Partner |
Person pseudonym | PV pseudonym | Date | Offence type | Neighbourhood | Gender | Age group | Nationality | Prior PVs |
|---|---|---|---|---|---|---|---|---|
PRS-12DC66B62F83 | PV-795D706A47D6 | 2021-05-23 | Violence / interpersonal | Muide | M | Belgisch | 0 | |
PRS-12DC66B62F83 | PV-661E861336D7 | 2021-08-01 | Violence / interpersonal | Sint-Amandsberg | M | Belgisch | 3 | |
PRS-12DC66B62F83 | PV-CC570D1B8B80 | 2021-08-02 | Violence / interpersonal | Gentbrugge | M | Belgisch | 3 | |
PRS-12DC66B62F83 | PV-E871AF0588D2 | 2023-08-19 | Violence / interpersonal | Mariakerke | M | Belgisch | 1 | |
PRS-12DC66B62F83 | PV-7B70B46C910C | 2023-09-02 | Property crime | Ledeberg | M | Belgisch | 1 | |
PRS-12DC66B62F83 | PV-A9921AD699D6 | 2023-11-13 | Public order / substance | Gentbrugge | M | Belgisch | 3 | |
PRS-12DC66B62F83 | PV-9A654E2DCB93 | 2024-04-04 | Property crime | Gentbrugge | M | Belgisch | 2 |
Even after removing direct identifiers, quasi-identifiers (age group, nationality, marital status, gender together) can make individuals unique - the combination re-identifies them (Sweeney 2002: 87% of Americans uniquely identified by ZIP + DOB + sex).
Dataset | Records | Unique combos | % unique | Small group (n) | % small group |
|---|---|---|---|---|---|
Offenders (raw) | 85 | 9 | 30% | 39 | 46% |
Victims (raw) | 110 | 9 | 21% | 63 | 57% |
Dataset | Records | Unique combos | % unique | Small group (n) | % small group |
|---|---|---|---|---|---|
Offenders (pseudonymized) | 85 | 9 | 30% | 39 | 46% |
Victims (pseudonymized) | 110 | 9 | 21% | 63 | 57% |
These two summary tables show that pseudonymization removes direct identifiers, but does not on its own eliminate all uniqueness risk in quasi-identifier combinations. That is why the k-anonymity check below is still needed.
## No rows in ka_check - skipping plot.
## Offenders - nationality-generalised records: 55 of 85
## Offenders - age-group-suppressed records: 0 of 85
## Victims - nationality-generalised records: 83 of 110
## Victims - age-group-suppressed records: 0 of 110
This is the complete pipeline a researcher receives. They have no access to real identifiers - only pseudonymous IDs and generalised attributes - yet they can perform full longitudinal and cross-dataset analysis.
Figure 11: Offence type by age group derived from the pseudonymized researcher dataset (Part II). This chart demonstrates that cross-variable analysis remains fully possible: offender age groups (from pseudonymized records) are linked to incident crime types via the PV pseudonym, without any direct identifier being present in the data.
Figure 12: Crime type by neighbourhood heatmap derived from the pseudonymized researcher dataset (Part II). Full analytical detail is preserved: no direct identifiers remain, yet cross-variable tabulation across neighbourhoods and offence types is unimpeded.
This section keeps only the additional release checks that are not already visible in the linkage tables above. The core point is straightforward: deterministic pseudonymization preserves joins, but a research release still needs decisions about timestamp precision and free-text redaction before it can be shared safely.
Exact timestamps can become identifying when they are combined with offence category, neighbourhood, or person-level attributes. For that reason, the practical question is not whether time should be dropped entirely, but how far it should be coarsened before release.
Precision level | Unique combinations | Total records | % unique | Re-id risk |
|---|---|---|---|---|
Exact (date + HH:MM) | 119 | 120 | 99.2 | High |
Date + hour of day (24) | 118 | 120 | 98.3 | High |
Date + time band (4) | 117 | 120 | 97.5 | High |
Week + time band | 108 | 120 | 90.0 | High |
Month + time band | 87 | 120 | 72.5 | High |
Structured pseudonymization is not enough when police records include narrative text. Names, RRNs, addresses, and times often remain embedded in natural language fields and therefore require a separate redaction step before a research extract can be released.
Record ID | Raw narrative | Redacted narrative |
|---|---|---|
PV-A1B2 | Op 15/03/2022 om 23:45 werd Jan De Smedt, RRN 86.04.12-234.71, wonende Langestraat 42, 9000 Gent, aangetroffen. | Op [DATUM] om [TIJD] werd [NAAM], RRN [RRN], wonende [ADRES], [POSTCODE], aangetroffen. |
PV-C3D4 | Het voertuig werd bestuurd door Mohamed El Amrani, geboortedatum 04/04/1990, RRN 90.04.04-123.45, uit Merelbeke. | Het voertuig werd bestuurd door [NAAM], geboortedatum [DATUM], RRN [RRN], uit Merelbeke. |
PV-E5F6 | Slachtoffer Emma Vandenberghe (geb. 12-06-1978), verblijvend te Veldstraat 18, Gent, deed aangifte van diefstal. | [NAAM] (geb. [DATUM]), verblijvend te [ADRES], Gent, deed aangifte van diefstal. |
PV-G7H8 | Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om 02:30. | Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om [TIJD]. |
Interpretation: The regex example is only a baseline used to illustrate the transformation logic. In practice, entity detection in free text needs a stronger NLP layer because names, addresses, and times appear in many formats that simple patterns do not capture reliably.
Mechanism | Identifier addressed | Legal basis (GDPR / LED 2016/680 / WPA) |
|---|---|---|
HMAC-SHA256 pseudonymization | PV number, Person ID, RRN | GDPR Art. 4(5); LED Art. 3(b) — pseudonymization; key held by controller |
Age group (not exact DOB) | Date of birth | GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — data minimisation |
Nationality retained | Nationality | GDPR Art. 9(1); LED Art. 10 — special categories in criminal justice context |
Address removed | Street, house number | GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — minimum necessary data |
k-Anonymity suppression | QI combinations | GDPR Art. 89(1); LED Art. 4(e) — proportionate technical safeguards for research |
Universal person pseudonym | Cross-dataset identity | GDPR Rec. 26; LED Rec. 26 — re-identification must not be reasonably possible |
Key held separately | De-pseudonymization | GDPR Art. 25; LED Art. 20; WPA Art. 44/7 — data protection by design |
Key point for the UGent Crime Lab project: The consistent pseudonymization key is the technical core of the anonymization algorithm. It must be: - Generated once per data release by the data controller (local police zone) - Stored in an HSM or certified key vault - never in the research environment - Rotated per researcher batch to prevent cross-batch re-identification - Audited (GDPR Art. 30 processing register)
Wet op het Politieambt (WPA) -- Belgian Police Act: Belgian police data is additionally governed by the Wet op het Politieambt (WPA, Belgisch Staatsblad 22 December 1992, as amended). Art. 44/1 WPA requires that personal data collected in the performance of police duties be accurate, adequate, relevant and not excessive. Art. 44/3 WPA mandates retention periods proportionate to purpose. Art. 44/7 WPA grants data subjects rights of access and correction. The pseudonymization pipeline directly operationalises the WPA data minimisation requirement: only the minimum attributes needed for scientific analysis are transferred to the researcher; all surplus personal data is removed or transformed before the data leaves the police system.
LED 2016/680 -- EU Law Enforcement Directive: Police data held for law enforcement purposes falls under Directive (EU) 2016/680 (transposed in Belgium by the Law of 30 July 2018), not the GDPR directly. Key distinctions from GDPR: lawfulness of processing derives from national law (LED Art. 8, not GDPR Art. 6); research access requires a documented scientific purpose with minimum-necessary data (LED Art. 4(2)); special categories (ethnic origin, health, criminal history) are subject to LED Art. 10. The compliance table above maps each mechanism to both its GDPR analogue and the corresponding LED / WPA article.
The report does not demonstrate every operational challenge of a full
police-data release system, but it does show three constraints that
remain central in practice. First, spatial masking does not affect every
criminological analysis in the same way: in Part I the
logdistance and prior_crime_within_1km_any
terms are more sensitive to point displacement than several ward-level
opportunity covariates, so masking has to be validated against the
target analysis rather than chosen abstractly. Second, linked files
remain usable only when pseudonymization is deterministic, because joins
fail if key use or identifier formatting varies across extracts. Third,
even after direct identifiers are removed, linked researcher files can
still contain rare quasi-identifier combinations, which is why
suppression and coarsening remain necessary before release.
The relevance of my background to this project lies less in claiming a complete production solution and more in bringing together the parts already demonstrated here: spatial criminological analysis, reproducible implementation, and structured disclosure-control thinking.
Spatial criminology, scale, and model sensitivity Work on crime location choice and related spatial criminological questions is directly relevant to the first constraint above. It also includes attention to how changing spatial scale affects criminological interpretation. That matters because anonymization decisions should be tied to analytical consequences, not only to abstract privacy principles.
Linked-data handling and pseudonymization logic Experience with crime data and familiarity with police-style file structures are relevant to the second constraint. A linked release is only useful when identifiers are handled consistently across files and over time. The pseudonymization section of this report was built to show exactly that issue: the technical task is not only to hide names, but to preserve the joins researchers actually need.
Reproducible implementation and risk auditing Work
in R, sf, simulation, and reproducible workflows is
relevant to the third constraint because quasi-identifier risk is not
something that should be checked informally. It has to be implemented,
measured, documented, and rerun when release conditions change. That is
the part of the project where statistical reasoning and implementation
work meet most clearly.
Security note on key mapping table: A mapping
between real identifiers (RRN, PV numbers) and their pseudonyms has been
written to outputs/Part_II_Pseudonymization/secure_vault/.
This file is for police system use only. It must never be shared with
researchers or stored in the research environment. In production, this
table would be held in the data controller's HSM or certified key vault,
satisfying GDPR Art. 32.
All names, RRN numbers, and case details in this section are fully synthetic. No real persons are represented.
Academic sources
Bernasco, W., & Nieuwbeerta, P. (2005). How do residential burglars select target areas? British Journal of Criminology, 45(3), 296-315. https://doi.org/10.1093/bjc/azi005
European Union Agency for Cybersecurity. (2019). Pseudonymisation techniques and best practices: Recommendations on shaping technology according to data protection and privacy provisions. Publications Office of the European Union.
Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23(7), 1025-1044. https://doi.org/10.1068/a231025
Kuralarasan, K., & Bernasco, W. (2022). Location choice of snatching offenders in Chennai City. Journal of Quantitative Criminology, 38, 673-696. https://doi.org/10.1007/s10940-021-09514-9
Openshaw, S. (1984). The modifiable areal unit problem. Geo Books.
Song, G., Bernasco, W., Liu, L., Xiao, L., Zhou, S., & Liao, W. (2017). Crime feeds on legal activities: Daily mobility flows help to explain thieves' target location choices. Journal of Quantitative Criminology, 33(4), 831-854. https://doi.org/10.1007/s10940-016-9326-0
Steenbeek, W., & Weisburd, D. (2016). Where the action is in crime? An examination of variability of crime across different spatial units in The Hague, 2001-2009. Journal of Quantitative Criminology, 32(3), 449-469. https://doi.org/10.1007/s10940-015-9276-3
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570. https://doi.org/10.1142/S0218488502001648
Vandeviver, C., Van Daele, S., & Vander Beken, T. (2015). What makes long crime trips worth undertaking? Balancing costs and benefits in burglars' journey to crime decisions. British Journal of Criminology, 55(2), 399-420. https://doi.org/10.1093/bjc/azu093
Weisburd, D., Groff, E. R., & Yang, S.-M. (2012). The criminology of place: Street segments and our understanding of the crime problem. Oxford University Press.
Legal sources
Belgian Federal Government. (1992). Wet op het Politieambt [Belgian Police Act]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/
Belgian Federal Government. (2018). Wet van 30 juli 2018 betreffende de bescherming van natuurlijke personen met betrekking tot de verwerking van persoonsgegevens [Law of 30 July 2018 on the protection of natural persons with regard to the processing of personal data]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/
European Parliament and Council of the European Union. (2016a). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88. https://eur-lex.europa.eu/eli/reg/2016/679/oj
European Parliament and Council of the European Union. (2016b). Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data. Official Journal of the European Union, L 119, 89-131. https://eur-lex.europa.eu/eli/dir/2016/680/oj