Scope of the report
Police-derived crime data always carries two disclosure problems at once: sensitive locations that reveal where crimes happened and where offenders live, and identifiers — PV numbers, RRN, person attributes — that allow the same individual to be traced across files and datasets. In practice, any anonymization pipeline must handle both simultaneously: masking locations without destroying the analytical model, and pseudonymizing identifiers without breaking the cross-file joins that make linked-data research possible.
This document demonstrates that both problems can be addressed within a reproducible workflow:
Together they form the analytical and technical foundation of a pipeline that is both privacy-protective and analytically valid.
Part I applies four spatial masking methods — two grid aggregation levels and two geomasking radii — to a real street robbery dataset from Chennai, India. Each method shifts or aggregates crime incident points and offender home locations to reduce location disclosure risk. The masked datasets are then used to re-estimate the crime location choice model, and the results are compared against the unmasked baseline. The central question is how much the model coefficients change: a small deviation means the masked data can still support the same analytical conclusions, while a large deviation means the transformation has distorted the results beyond practical use.
The baseline follows the Model 2 specification from the published Chennai snatching study (Kuralarasan & Bernasco, 2022, Table 3), with one adaptation: the same-ward prior-crime indicator is replaced by a boundary-free 1 km familiarity indicator. This change makes the model less sensitive to arbitrary administrative boundaries once crime and home points are spatially masked. Table 1.1 below ranks the tested methods by overall deviation from the unmasked result.
This comparison is also relevant from the perspective of spatial units of analysis. A substantial literature on the modifiable areal unit problem shows that model coefficients can change when phenomena are aggregated into different spatial units, especially when administrative boundaries do not align well with the behavioral processes under study (Fotheringham & Wong, 1991; Openshaw, 1984). In spatial analysis specifically, variation across spatial units is often large enough to alter interpretation (Steenbeek & Weisburd, 2016; Weisburd et al., 2012). For that reason, spatial masking should be evaluated not only as a privacy intervention but also as a change in the spatial representation through which offender decision-making is measured.
The prior-crime familiarity indicator is especially sensitive to administrative boundary dependence. A same-ward indicator is administratively convenient, but it registers a boundary crossing as a meaningful change in offender familiarity even when the masked point remains geographically close to its original position. A boundary-free 1 km indicator is therefore more robust to small positional shifts and more consistent with arguments in spatial criminology that offenders' awareness spaces and relevant opportunity structures do not necessarily coincide with administrative borders (Bernasco & Nieuwbeerta, 2005; Song et al., 2019).
Method | RMSE | Max abs. bias | Most affected covariate | Crime point in same ward (%) | Home point in same ward (%) |
|---|---|---|---|---|---|
Grid 250m | 0.017 | 0.038 | Marriage halls (per 10) | 86.5 | 87.2 |
Geomask 50-300m | 0.028 | 0.075 | Mosques (per 10) | 79.0 | 79.3 |
Grid 500m | 0.040 | 0.124 | Any prior crime within 1 km (0,1) | 78.7 | 79.1 |
Geomask 200-600m | 0.084 | 0.336 | Any prior crime within 1 km (0,1) | 59.0 | 59.2 |
Grid 1000m | 0.091 | 0.224 | Mosques (per 10) | 58.1 | 62.6 |
Geomask 400-1200m | 0.114 | 0.496 | Any prior crime within 1 km (0,1) | 34.5 | 32.0 |
Table 1.1 is a joint summary of overall model drift
and spatial reassignment stability. RMSE
reports the root mean square error of coefficient deviation from the
unmasked model, while Max abs. bias shows the single
largest coefficient shift within each masking method. The
Most affected covariate column identifies which coefficient
is most distorted, and the two ward-stability columns show how often
masked crime points and home points remain in their original wards after
masking — together, these columns reveal not only which method ranks
best overall, but also why some methods degrade more quickly
than others.
The best-performing masking method in this analysis is Grid
250m. It has the lowest RMSE (0.017), the smallest
maximum single-term bias (0.038), and the highest ward
stability for both crime points and home points (about
86%–87%). The second-ranked method remains
relatively close to the baseline, but already shows lower ward stability
(about 79%) and a larger maximum bias. The next masking
level still preserves the broad substantive pattern, but it is the first
method where the boundary-free familiarity term becomes the
worst-affected coefficient, indicating that the model starts to feel the
effects of more frequent spatial reassignment. The wider masking
settings show the same pattern more clearly: once ward stability drops
into the 50% range and below, coefficient distortion
increases substantially even though the general direction of the main
findings remains recognizable.
Table 1.2 compares the unmasked baseline directly against the best-performing masking method. It reports odds ratios, p-values, and the percentage change in the odds ratio. This is the most direct way to judge whether masking changes the substantive interpretation of the Chennai model under the boundary-free familiarity specification.
Term | Baseline OR | Baseline p | Masked OR | Masked p | OR diff. | % OR change |
|---|---|---|---|---|---|---|
Distance (km) | 0.356 | 0.000 | 0.350 | 0.000 | -0.007 | -1.832 |
Any prior crime within 1 km (0,1) | 10.018 | 0.000 | 9.741 | 0.000 | -0.277 | -2.767 |
Area (km2) | 1.075 | 0.000 | 1.092 | 0.000 | 0.017 | 1.576 |
Population (per 1,000) | 1.004 | 0.208 | 1.003 | 0.403 | -0.001 | -0.132 |
Retail stores (per 10) | 1.010 | 0.358 | 1.013 | 0.261 | 0.002 | 0.228 |
Transit stations (per 10) | 1.039 | 0.432 | 1.031 | 0.524 | -0.008 | -0.727 |
Mosques (per 10) | 1.073 | 0.450 | 1.093 | 0.341 | 0.020 | 1.859 |
Temples (per 10) | 1.009 | 0.760 | 0.980 | 0.494 | -0.029 | -2.916 |
Churches (per 10) | 1.182 | 0.000 | 1.161 | 0.000 | -0.021 | -1.757 |
Education institutions (per 10) | 1.072 | 0.000 | 1.066 | 0.000 | -0.005 | -0.507 |
School and college (per 10) | 0.989 | 0.669 | 0.994 | 0.814 | 0.005 | 0.495 |
Personal care (per 10) | 1.066 | 0.003 | 1.052 | 0.021 | -0.015 | -1.379 |
Hospitals (per 10) | 0.999 | 0.950 | 1.000 | 0.994 | 0.001 | 0.113 |
Marriage halls (per 10) | 1.123 | 0.004 | 1.168 | 0.000 | 0.044 | 3.921 |
Jewelleries (per 10) | 1.025 | 0.163 | 1.035 | 0.045 | 0.010 | 1.003 |
Textiles (per 10) | 0.987 | 0.128 | 0.984 | 0.058 | -0.003 | -0.331 |
Park (per 10) | 1.176 | 0.017 | 1.164 | 0.024 | -0.012 | -1.018 |
Recreation facilities (per 10) | 0.957 | 0.124 | 0.984 | 0.556 | 0.026 | 2.743 |
Restaurant (per 10) | 1.028 | 0.033 | 1.035 | 0.009 | 0.006 | 0.612 |
Government office (per 10) | 1.058 | 0.013 | 1.062 | 0.007 | 0.004 | 0.375 |
The main findings remain stable under the best masking method. The distance effect remains negative and strong, and the boundary-free 1 km familiarity effect remains strongly positive. In the HTML version, the masked p-value column is colour-coded by significance stability relative to the baseline, while the absolute difference and percentage change in the odds ratio are shaded by the size of the deviation. The core criminological interpretation therefore does not reverse under the best-performing masking specification.
The two figures below complement Tables 1.1 and 1.2 by showing the same masking results at two different levels of summary. Figure 1.1 is coefficient-specific: it shows how each odds ratio moves across masking methods relative to the unmasked baseline, marked by the dashed line in each panel. This makes it possible to see which terms remain tightly clustered across methods and which terms drift more clearly as masking becomes stronger. Figure 1.2 then collapses that information into a single method-level ranking using RMSE, so it should be interpreted as a compact summary of overall deviation rather than as a substitute for the term-by-term comparison in Table 1.2.
Figure 1.1: Odds ratios by masking method
Figure 1.2: RMSE ranking of masking methods
Taken together, the figures reinforce the pattern already visible in the tables. The smaller masking settings remain much closer to the unmasked baseline, while the wider geomasking and coarser grid methods introduce visibly larger coefficient movement. The coefficient plot also shows that distortion is not evenly distributed across terms: some ward-level opportunity covariates remain comparatively stable, whereas the familiarity term and a smaller number of place-based covariates become more sensitive as spatial reassignment becomes more common. The RMSE ranking in Figure 1.2 is therefore best interpreted as a summary of a broader pattern already visible in Figure 1.1, not as an isolated performance score.
This section reports a reduced model excluding the prior-location covariate. The main Chennai specification retains a boundary-free prior-familiarity covariate based on whether the offender had previously offended within 1 km of the candidate location, because that measure is less sensitive to arbitrary boundary crossings than a same-ward indicator. A reasonable concern, however, is that the overall masking results might partly depend on that modeling choice. To address that concern, I also estimated a reduced specification that drops the prior-location covariate entirely and re-runs the same masking comparison on the reduced model.
This reduced model is not treated as the preferred substantive specification. It removes an important mechanism of spatial familiarity and therefore answers a more limited question: if that mechanism is omitted altogether, do the masking results still show the same broad ranking pattern? Read in that way, the reduced model is a robustness check on the masking comparison, not a replacement for the main model. Table 1.3 presents the results.
Robustness specification | Best masking method | Best RMSE | Crime point stays in same ward (%) | Home point stays in same ward (%) |
|---|---|---|---|---|
Reduced model (drop prior-crime term) | Geomask 50-300m | 0.015 | 79.8 | 80 |
In that reduced model, the best masking method was Geomask 50-300m with an RMSE of about 0.015, which is slightly lower than the main 1 km familiarity model. That lower RMSE should not be over-interpreted as evidence that the reduced model is substantively better. With one less behaviorally important covariate to preserve, the reduced specification is simply easier to reproduce after masking. The important point is that the smaller masking settings still perform best, while the wider masking settings still introduce visibly more distortion. In other words, the broad masking pattern does not disappear when the prior-location covariate is omitted, but the main specification remains preferable because it preserves a more meaningful behavioural mechanism.
Part I is a baseline-deviation study: it measures
how far the masked model moves away from the original fitted result,
rather than attempting to recover true population parameters. The main
result is that smaller masking settings preserve the original model
coefficients most effectively: Grid 250m performs best,
Geomask 50-300m remains close, and wider masking settings
introduce progressively larger distortion. The reduced-model robustness
check supports the same broad conclusion: omitting the prior-location
covariate changes the preferred specification, but it does not overturn
the overall ranking pattern in which smaller masking settings are
analytically safer. The practical implication is therefore
straightforward: spatial masking can remain compatible with the original
analytical conclusion, but only within a relatively limited range of
spatial displacement or aggregation.
Belgian police data spans roughly 196 local police zones and multiple institutional systems — local operational registers (ISLP/ISLP2), national reference systems (ANG/BNG, FEEDIS), justice environments (JustCase, JustMask), prison systems (Sidis Suite), and forensic research infrastructures (DOT, be.care). The same person, case, or event can therefore appear across these separate systems under different local formatting conventions. Without an algorithmic approach, zones anonymize data inconsistently: one zone may redact birth year while another retains it; one may suppress nationality while another codes it differently. Aggregated research then carries systematic measurement error invisible to the researcher, and cross-dataset joins on PV numbers or person identifiers break silently.
Removing names is not sufficient to address this. PV numbers, the national registry number (Rijksregisternummer/RRN), and date of birth together can still support re-identification, and stripping them with a fresh random ID each time destroys the cross-file joins that make linked-data research possible. This section demonstrates that consistent keyed pseudonymization — applying HMAC(key, RRN) deterministically so the same person always receives the same pseudonym regardless of which zone or system extracted the record — solves both problems simultaneously: direct identifiers are removed and cross-dataset linkage is preserved. De-pseudonymization is only possible with the secret key, held by the data controller.
The three synthetic datasets used below — crime incidents, offender records, and victim records — were generated to resemble Belgian police/crime data. All fields, identifiers, and record structures follow Belgian police data conventions, but no real personal data are used.
Table 2.1 shows the structure of the synthetic incident register before pseudonymization. It represents recorded criminal incidents registered through PV-based workflows in a Belgian police data structure. The important analytical fields are the PV number, date, offence type, neighbourhood, and injury/outcome variables.
PV Number | Date | Offence Type | Neighbourhood | Injury |
|---|---|---|---|---|
2022/GNT/05041 | 2023-08-04 | Property crime | Wondelgem | Geen |
2024/GNT/03266 | 2022-02-18 | Public order / substance | Sint-Amandsberg | Geen |
2022/GNT/05326 | 2022-03-12 | Property crime | Mariakerke | Geen |
2022/GNT/09504 | 2024-07-07 | Public order / substance | Mariakerke | Geen |
Taken together with the person-level files below, this incident register provides the case-level anchor for later linkage checks.
Some offenders committed multiple crimes (repeat offenders - same person_id in multiple rows).
Table 2.2 is a person-event file: the same person can recur across multiple PV records, making linkage preservation a central requirement of the anonymization pipeline.
Person ID | PV Number | Gender | Nationality | Role |
|---|---|---|---|---|
D0057 | 2021/GNT/03774 | V | Duits | Verdachte |
D0044 | 2022/GNT/03783 | M | Congolees | Verdachte |
D0028 | 2024/GNT/03266 | M | Roemeens | Verdachte |
D0041 | 2024/GNT/09571 | V | Nederlands | Verdachte |
Some victims appear in multiple incidents. Critically, some persons are BOTH offender and victim in different cases — a real pattern in interpersonal violence data.
Table 2.3 mirrors the offender file from the victim side, showing that the eventual release logic has to preserve not only offender-to-incident joins, but also cross-role person linkage across files.
Person ID | PV Number | Gender | Nationality | Relation to Suspect |
|---|---|---|---|---|
S0016 | 2024/GNT/05463 | M | Duits | Kennis |
S0037 | 2022/GNT/06644 | M | Pools | Onbekend |
S0061 | 2023/GNT/08088 | M | Frans | Collega |
S0066 | 2024/GNT/03176 | M | Duits | Familielid |
A naive approach strips direct identifiers (name, RRN, address) and replaces each identifier with a new random ID every time. This is the most common mistake in ad-hoc anonymization.
Why this fails: Random replacement generates a
different code for 2022/GNT/00341 in the offender file and
a completely different code in the victim file. Researchers cannot join
on PV number — the data becomes useless for cross-file analysis.
The root cause is non-determinism: because each call
to sample() is independent, the same input value produces a
different output in each file. The fix is not to add more redaction — it
is to replace randomness with a deterministic function: one
that always maps the same input to the same output, using a secret key
that only the data controller holds. Table 2.2b quantifies this failure
using actual join counts.
Join type | Raw data | After naive anonymization | Result |
|---|---|---|---|
Offender → Victim (shared PV number) | 54 shared PV numbers | 0 matched rows | ✗ Complete linkage failure |
I propose to use HMAC-SHA256, which can be understood here as a standard keyed hashing method. It turns personal identifiers such as an RRN or PV number into a stable pseudonym using a secret key applied consistently across all files (see §2.2 for why this determinism is the core requirement). The resulting output is a 24-character hexadecimal token: collision-resistant and non-invertible without the key.
A terminological point is important here. Under GDPR Recital 26 and Art. 4(5) (as reiterated in EDPB Guidelines 01/2025 on Pseudonymisation), data is pseudonymized — not anonymized — as long as a key exists that could link the pseudonym back to the original identifier. This pipeline is therefore correctly described as a privacy-preserving transformation, not full anonymization: it replaces direct identifiers with stable pseudonyms while keeping the key under the exclusive control of the data controller. Full anonymization would require the key to be permanently destroyed, which would also permanently destroy the ability to audit or correct the released data. The approach here is deliberately and correctly pseudonymization.
ENISA discusses keyed-hash / HMAC-style approaches as valid pseudonymization techniques (ENISA, 2019), and this implementation is compatible with GDPR Art. 4(5).
The same HMAC key is applied to all three datasets. Each direct identifier — RRN, PV number, name, address — is either pseudonymized or removed. The researcher-facing files retain only analytical fields and stable pseudonyms.
The researcher-facing files shown below retain analytical fields and stable pseudonyms, but remove direct identifiers such as names, RRN, and raw PV numbers.
PV Pseudonym | Person Pseudonym | Gender | Nationality | Role |
|---|---|---|---|---|
PV-841A5F319D8B609FB41C7E8C | D-F23B9BA25B6EC33EF0F3BD2C | V | Duits | Verdachte |
PV-85A517390E8F41928753A119 | D-523BA24A18253BDFC7A8C87D | M | Congolees | Verdachte |
PV-7E79B32DE9FE421A4CA45E08 | D-1895C0DB1C22DB37728E97CE | M | Roemeens | Verdachte |
PV-01FB2838F77F4CBDEAD4396E | D-80C0220A9292AD1AE3170D97 | V | Nederlands | Verdachte |
PV-CE1B202E69F37EAF63F9FD98 | D-AC4CC429F7F3D5489E14522B | V | Belgisch | Verdachte |
PV Pseudonym | Person Pseudonym | Gender | Nationality | Role |
|---|---|---|---|---|
PV-BDD76BBF3F1C5F8909C5DD9D | S-F71BABA98A2E1DA1A8D494C6 | M | Duits | Slachtoffer |
PV-C19BF58E43D7145B7155580B | S-ACF088B1C241F2BD6C561B79 | M | Pools | Slachtoffer |
PV-661E861336D7E15104CD9F69 | S-AB252BC3C6E5F0B6B00863D2 | M | Frans | Slachtoffer |
PV-B87EB997C7140CFAA42EE97A | S-7DF963006DA72F97BF4CF224 | M | Duits | Slachtoffer |
PV-7053CD96B4018C7075F2DC80 | S-3D6FEB2A6CA2AB892563603F | M | Duits | Slachtoffer |
Tables 2.4 and 2.5 should be interpreted as structure checks: they show that the research files still contain the fields needed for analysis and linkage, but no longer expose direct personal identifiers.
The core claim of consistent keyed pseudonymization is that all analytical joins work identically before and after anonymization. Table 2.5b summarises this as a scorecard across three join types demonstrated in Sections 2.4.1–2.4.3:
This matters not only for longitudinal or cross-file analysis, but also for network analysis and intelligence-led or forensic insight, because stable person- and case-level pseudonyms allow co-involvement, repeat contacts, and event relationships to be reconstructed without exposing direct identifiers.
Join type | Raw data | After pseudonymization | Preserved? |
|---|---|---|---|
Incidents raw count | 120 | 120 | 100% (ok) |
Unique PV numbers (incidents) | 120 | 120 | 100% (ok) |
Unique offender persons (by RRN/pseudonym) | 42 | 42 | 100% (ok) |
Offender → Incident joins (PV number) | 85 | 85 | 100% (ok) |
Offender → Victim joins (same person) | 4 | 4 | 100% (ok) |
The scorecard should be read by comparing each join count before and after pseudonymization. The Raw data column uses real identifiers such as RRN and PV number, whereas the After pseudonymization column uses only the derived pseudonyms. Identical counts show that the intended joins are preserved after direct identifiers are transformed. That result demonstrates linkage integrity, but it does not imply that the release is risk-free: quasi-identifier risk remains and is evaluated in the next sections.
Table 2.6 shows the simplest preserved join in the release pipeline: offender attributes remain linkable to incident-level case information through the shared PV pseudonym. Two key column labels require clarification: the PV pseudonym is the pseudonymized case identifier (derived from the original PV number and stable across all files); the Person pseudonym is the pseudonymized offender identifier (derived from the RRN via HMAC-SHA256). Neither field contains any direct personal information — they serve solely as stable, researcher-safe linkage keys.
PV pseudonym | Person pseudonym | Gender | Age group | Nationality | Prior PVs | Known to police | Date | Offence type | Neighbourhood | Injury |
|---|---|---|---|---|---|---|---|---|---|---|
PV-841A5F319D8B609FB41C7E8C | PRS-F23B9BA25B6EC33EF0F3BD2C | V | Duits | 0 | Nee | 2021-07-01 | Property crime | Muide | Zwaar | |
PV-85A517390E8F41928753A119 | PRS-523BA24A18253BDFC7A8C87D | M | Congolees | 1 | Ja | 2022-07-28 | Public order / substance | Sint-Amandsberg | Geen | |
PV-7E79B32DE9FE421A4CA45E08 | PRS-1895C0DB1C22DB37728E97CE | M | Roemeens | 0 | Nee | 2022-02-18 | Public order / substance | Sint-Amandsberg | Geen | |
PV-01FB2838F77F4CBDEAD4396E | PRS-80C0220A9292AD1AE3170D97 | V | Nederlands | 0 | Nee | 2023-11-21 | Public order / substance | Bloemekenswijk | Licht | |
PV-CE1B202E69F37EAF63F9FD98 | PRS-AC4CC429F7F3D5489E14522B | V | Belgisch | 1 | Ja | 2021-08-31 | Property crime | Muide | Geen | |
PV-B87EB997C7140CFAA42EE97A | PRS-4700C7A200C8AB703BE079FA | M | Congolees | 1 | Nee | 2022-12-23 | Public order / substance | Sint-Amandsberg | Licht |
As noted in §2.1.3, some persons appear in both the offender and victim registers across different cases — the cross-role pattern characteristic of interpersonal violence data. Table 2.7 confirms that the universal person pseudonym preserves this linkage after pseudonymization: researchers can trace the same individual across roles without ever knowing who that person is.
Person pseudonym | Incident PV No. (offender role) | Role (offender) | Incident PV No. (victim role) | Role (victim) | Relation to suspect |
|---|---|---|---|---|---|
PRS-64439954BA3B5F3509FF73C6 | PV-8CAE09F42E9D407D204ED698 | Verdachte | PV-ED87DE7C95D7C998D7074D5D | Slachtoffer | Onbekend |
PRS-D2811785D27F7368CAFE362B | PV-EF8DB40B581698D0B476B36E | Verdachte | PV-24464F55C28975FB5AA79987 | Slachtoffer | Onbekend |
PRS-D2811785D27F7368CAFE362B | PV-EF8DB40B581698D0B476B36E | Verdachte | PV-23FFB7E4AD3DF5B710508392 | Slachtoffer | Onbekend |
PRS-1E4DC75EB6FF9F17A9B1FE74 | PV-6DFFC4C792512AC2BEF2DDF5 | Verdachte | PV-21358D6B89AD66EE1E0C7406 | Slachtoffer | Partner |
Table 2.8 extends the same logic over time: it shows that repeat involvement across multiple PV records can still be reconstructed for one pseudonymous individual, which is essential for longitudinal offending analysis.
Person pseudonym | PV pseudonym | Date | Offence type | Neighbourhood | Gender | Age group | Nationality | Prior PVs |
|---|---|---|---|---|---|---|---|---|
PRS-12DC66B62F83CA4CE2DDE998 | PV-795D706A47D6B9B8313E718D | 2021-05-23 | Violence / interpersonal | Muide | M | Belgisch | 0 | |
PRS-12DC66B62F83CA4CE2DDE998 | PV-661E861336D7E15104CD9F69 | 2021-08-01 | Violence / interpersonal | Sint-Amandsberg | M | Belgisch | 3 | |
PRS-12DC66B62F83CA4CE2DDE998 | PV-CC570D1B8B80400F821D559F | 2021-08-02 | Violence / interpersonal | Gentbrugge | M | Belgisch | 3 | |
PRS-12DC66B62F83CA4CE2DDE998 | PV-E871AF0588D26888F08E6816 | 2023-08-19 | Violence / interpersonal | Mariakerke | M | Belgisch | 1 | |
PRS-12DC66B62F83CA4CE2DDE998 | PV-7B70B46C910CA605295A4B82 | 2023-09-02 | Property crime | Ledeberg | M | Belgisch | 1 | |
PRS-12DC66B62F83CA4CE2DDE998 | PV-A9921AD699D6DCB60A9DA459 | 2023-11-13 | Public order / substance | Gentbrugge | M | Belgisch | 3 | |
PRS-12DC66B62F83CA4CE2DDE998 | PV-9A654E2DCB93B08E64436258 | 2024-04-04 | Property crime | Gentbrugge | M | Belgisch | 2 |
Even after removing direct identifiers, quasi-identifiers (age group, nationality, marital status, gender together) can make individuals unique — the combination re-identifies them (Sweeney, 2000: 87% of Americans uniquely identified by ZIP code, date of birth, and sex).
Table 2.9 reports quasi-identifier uniqueness in the raw files — before any pseudonymization — to establish a risk baseline. Table 2.10 reports the same metric in the pseudonymized files. The values are expected to be identical across both tables: pseudonymization replaces direct identifiers (names, RRN, PV numbers) but leaves quasi-identifiers — age group, nationality, marital status, and gender — unchanged. The comparison therefore confirms that pseudonymization alone does not eliminate re-identification risk through quasi-identifier combinations, which is why the k-anonymity suppression step in Section 2.5.3 remains necessary.
Dataset | Records | Distinct combos | Unique records (n) | % unique | Small group (n) | % small group |
|---|---|---|---|---|---|---|
Offenders (raw) | 85 | 32 | 10 | 12% | 44 | 52% |
Victims (raw) | 110 | 46 | 12 | 11% | 74 | 67% |
Dataset | Records | Distinct combos | Unique records (n) | % unique | Small group (n) | % small group |
|---|---|---|---|---|---|---|
Offenders (pseudonymized) | 85 | 32 | 10 | 12% | 44 | 52% |
Victims (pseudonymized) | 110 | 46 | 12 | 11% | 74 | 67% |
Tables 2.9 and 2.10 should be read comparatively. The percentages are record-level shares, not shares of distinct combination types. The identical before/after figures confirm what the section introduction above explains: pseudonymization does not alter the quasi-identifiers used in the risk calculation.
That residual uniqueness risk is why the k-anonymity check below is still needed. Figure 2.1 visualises the cell-level risk distribution by age group, making it possible to see at a glance which age groups are most exposed before suppression is applied.
Figure 2.1: k-anonymity risk by age group
k-Anonymity requires that every record in a release file shares its quasi-identifier combination with at least k-1 other records, making individual re-identification at most a 1-in-k probability (Sweeney, 2002). To reduce the privacy risk created by raw country labels while retaining interpretability, nationality is released in grouped categories rather than being suppressed immediately. The main release scheme is Belgian / EU (non-Belgian) / Non-EU, with a fallback to Belgian / Non-Belgian when the 3-group cell still falls below the chosen threshold in this small synthetic sample. Table 2.11 defines the release categories, their source label coverage, and the privacy and analytical rationale for each.
Release category | Example source labels | Privacy rationale | Analytical usefulness | Recommended for small police datasets |
|---|---|---|---|---|
Belgian | Belgisch | Keeps the domestic reference category while removing country-level specificity. | Preserves the key contrast between Belgian and non-Belgian records. | Yes |
EU (non-Belgian) | Nederlands, Frans, Duits, Italiaans, Pools, Portugees, Roemeens, Spaans | Collapses several country labels into a broader region, reducing uniqueness from rare EU nationalities. | Retains a meaningful European mobility category without exposing exact country labels. | Yes |
Non-EU | Congolees, Marokkaans, Turks | Absorbs the highest-risk rare-country labels into one broad release category. | Preserves a coarse but policy-relevant distinction for descriptive analysis. | Yes |
Fallback: Non-Belgian | Applied when the 3-group cell is still below k | Further reduces small-cell risk in very small police datasets. | Provides a pragmatic fallback when the 3-group scheme remains too sparse. | Yes |
Dataset | Released in 3 groups (n) | % recoded to 3 groups | Fallback to 2 groups (n) | % fallback 2-group | Age group suppressed (n) | % age suppressed |
|---|---|---|---|---|---|---|
Offenders | 54 | 63.5 | 31 | 36.5 | 3 | 3.5 |
Victims | 64 | 58.2 | 46 | 41.8 | 1 | 0.9 |
Table 2.12 quantifies the practical cost of the final release logic. In small police datasets, grouped nationality categories often retain more analytical meaning than blanket suppression, but some records still need a fallback to Belgian versus non-Belgian and some age groups still need suppression when the corresponding age-by-gender cell remains below the threshold.
This is the complete pipeline a researcher receives. They have no access to real identifiers — only pseudonymous IDs and generalised attributes — yet they can perform full longitudinal and cross-dataset analysis.
Figure 2.2 shows the two concrete transformations the pipeline applies to offender records. The top row compares raw country-level nationality labels (Panel A) against the release categories (Panel B): most records are released under the 3-category system (Belgian / EU / Non-EU, shown in blue); records whose 3-category cell fell below k=5 are recoded to the 2-category fallback (Belgian / Non-Belgian, shown in orange). The bottom row restricts to records that had a valid age before suppression and shows which age-gender cells survive intact (Panel C) and which fall below the k=5 threshold and are withheld from the release (Panel D, grey bars). Figure 2.3 then confirms that neighbourhood-level cross-tabulation remains meaningful after the same protections are applied.
Figure 2.2: Nationality grouping and age suppression — before vs after pipeline (offender records)
We preserve neighbourhood names as analytical attributes while all person and case identifiers are pseudonymized. The heatmap below shows whether spatial cross-tabulation still works after these protections.
Figure 2.3: Offence type by neighbourhood (post-suppression)
Taken together, Figures 2.2 and 2.3 show that the release remains suitable for standard descriptive and exploratory analysis across age, offence type, and neighbourhood, even though direct identifiers have been removed and some attributes have been generalised.
Sections 2.3–2.6 demonstrate that deterministic keyed pseudonymization solves the linkage problem and eliminates direct identifiers. However, two categories of risk remain after pseudonymization is complete and require separate treatment before a research extract can be safely released. They are not deficiencies of the pipeline: they are inherent to any structured data release and must be addressed at the release-preparation stage.
Exact timestamps can become identifying when combined with offence category, neighbourhood, or person-level attributes — even when all direct identifiers have been removed. This risk is structural: it arises from the precision of the data itself, not from any failure of the pseudonymization step.
Precision level | Unique combinations | Total records | % unique | Re-id risk |
|---|---|---|---|---|
Exact (date + HH:MM) | 119 | 120 | 99.2 | High |
Date + hour of day (24) | 118 | 120 | 98.3 | High |
Date + time band (4) | 117 | 120 | 97.5 | High |
Week + time band | 108 | 120 | 90.0 | High |
Month + time band | 87 | 120 | 72.5 | High |
Table 2.13 should be interpreted as a trade-off table rather than a fixed rule. Its purpose is to show how quickly uniqueness drops as temporal precision is coarsened, and therefore how a data controller could justify releasing broader time bands instead of exact timestamps.
Structured pseudonymization is not enough when police records include narrative text. Names, RRNs, addresses, and times often remain embedded in natural language fields and therefore require a separate redaction step before a research extract can be released. Unlike temporal precision, which can be addressed by coarsening a date field, free-text risk requires entity recognition: the system must detect that a phrase is a name or an address before it can redact it. Table 2.14 illustrates this transformation on four synthetic Dutch-language police narratives.
Record ID | Raw narrative | Redacted narrative |
|---|---|---|
PV-A1B2 | Op 15/03/2022 om 23:45 werd Jan De Smedt, RRN 86.04.12-234.71, wonende Langestraat 42, 9000 Gent, aangetroffen. | Op [DATUM] om [TIJD] werd [NAAM], RRN [RRN], wonende [ADRES], [POSTCODE], aangetroffen. |
PV-C3D4 | Het voertuig werd bestuurd door Mohamed El Amrani, geboortedatum 04/04/1990, RRN 90.04.04-123.45, uit Merelbeke. | Het voertuig werd bestuurd door [NAAM], geboortedatum [DATUM], RRN [RRN], uit Merelbeke. |
PV-E5F6 | Slachtoffer Emma Vandenberghe (geb. 12-06-1978), verblijvend te Veldstraat 18, Gent, deed aangifte van diefstal. | [NAAM] (geb. [DATUM]), verblijvend te [ADRES], Gent, deed aangifte van diefstal. |
PV-G7H8 | Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om 02:30. | Geen persoonsgegevens aanwezig. Voertuig geparkeerd nabij het station om [TIJD]. |
The regex example is only a baseline used to illustrate the transformation logic. In practice, entity detection in free text needs a stronger NLP layer because names, addresses, and times appear in many formats that simple patterns do not capture reliably.
ALLOW_DEMO_KEY=true is set; in production the key must be
supplied by the data controller from a secure vault or HSM.While the General Data Protection Regulation (GDPR) establishes the overarching privacy-by-design principles reflected in this pipeline, the primary legislative authority for police data processing in Belgium is Directive (EU) 2016/680 — the Law Enforcement Directive (LED) — transposed into Belgian law by the Law of 30 July 2018. Data held for law enforcement purposes does not fall under the GDPR directly; the LED and the Wet op het Politieambt (WPA) together govern the lawfulness, retention, and research-access conditions that apply to Belgian police datasets. When the same data are disclosed to a non–law-enforcement controller (e.g., a university) for independent research, that downstream processing is normally subject to the GDPR.
Table 3.1 is a compliance map. It links each technical mechanism in the pipeline to the identifier or risk it addresses and to the corresponding legal basis under the GDPR, the LED, and the Wet op het Politieambt.
| Mechanism | Identifier addressed | Legal basis (GDPR / LED 2016/680 / WPA) |
|---|---|---|
| HMAC-SHA256 pseudonymization | PV number, Person ID, RRN | GDPR Art. 4(5); LED Art. 3(5) — pseudonymization; key held by controller |
| Age group (not exact DOB) | Date of birth | GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — data minimisation |
| Grouped nationality categories | Nationality | GDPR Art. 9(1); LED Art. 10 — special categories in criminal justice context |
| Address removed | Street, house number | GDPR Art. 5(1)(c); LED Art. 4(c); WPA Art. 44/1 — minimum necessary data |
| k-Anonymity suppression | QI combinations | GDPR Art. 89(1); LED Art. 4(3) — proportionate technical safeguards for research (scientific use subject to appropriate safeguards) |
| Universal person pseudonym | Cross-dataset identity | GDPR Rec. 26; LED Art. 3(5) — re-identification must not be reasonably possible for recipients; remains personal data for the controller |
| Key held separately | De-pseudonymization | GDPR Art. 25; LED Art. 20; LED Art. 29; WPA Art. 44/1; WPA Art. 44/11, §§ 7–14 — data protection by design and security of processing |
Key operational requirement for the UGent Crime Lab project. The consistent pseudonymization key is the technical core of the anonymization algorithm. It must be generated once per data release by the data controller (local police zone), stored in a hardware security module (HSM) or certified key vault and never held in the research environment, rotated per researcher batch to prevent cross-batch re-identification, and audited in accordance with the GDPR Art. 30 processing register obligation. This key management approach reflects data protection by design and security duties in GDPR Art. 25, LED Arts. 20 and 29, and WPA Art. 44/1.
Wet op het Politieambt (WPA) — Belgian Police Act. Belgian police data is additionally governed by the Wet op het Politieambt (WPA, Belgisch Staatsblad 22 December 1992, as amended). Art. 44/1 WPA requires that personal data collected in the performance of police duties be accurate, adequate, relevant and not excessive. Art. 44/3 WPA mandates retention periods proportionate to purpose. Art. 44/7 WPA grants data subjects rights of access and correction. The pseudonymization pipeline directly operationalises the WPA data minimisation requirement: only the minimum attributes needed for scientific analysis are transferred to the researcher; all surplus personal data is removed or transformed before the data leaves the police system.
LED 2016/680 — EU Law Enforcement Directive. Police data held for law enforcement purposes falls under Directive (EU) 2016/680 (transposed in Belgium by the Law of 30 July 2018), not the GDPR directly. Key distinctions from the GDPR are as follows: lawfulness of processing derives from national law (LED Art. 8, not GDPR Art. 6); research access requires a documented scientific purpose with minimum-necessary data (LED Art. 4(3)); and special categories including ethnic origin, health data, and criminal history are subject to LED Art. 10. The compliance table above maps each mechanism to both its GDPR analogue and the corresponding LED and WPA article.
The report does not demonstrate every operational challenge of a full police-data release system, but it does show three constraints that remain central in practice. First, spatial masking does not affect every criminological analysis in the same way: in Part I the log-distance and boundary-free 1 km familiarity terms are more sensitive to point displacement than several ward-level opportunity covariates, so masking has to be validated against the target analysis rather than chosen abstractly. Second, linked files remain usable only when pseudonymization is deterministic, because joins fail if key use or identifier formatting varies across extracts. Third, even after direct identifiers are removed, linked researcher files can still contain rare quasi-identifier combinations, which is why suppression and coarsening remain necessary before release.
This report demonstrates the technical foundations of a privacy-preserving pipeline for Belgian police data. The three areas below describe not only the competencies brought to bear in this demonstration, but specifically how I plan to carry the project forward into an operational data release system for the UGent Crime Lab.
Spatial criminology, scale, and model sensitivity The crime location choice framework applied in Part I will guide masking decisions in the operational pipeline: because spatial masking affects different covariates differently, each new release will need to be validated against the target analysis rather than evaluated in the abstract. I plan to extend this validation framework to additional offence types and to Belgian spatial units, building a masking-quality evidence base that is directly reusable across research projects within the Crime Lab.
Linked-data handling and pseudonymization logic The pseudonymization pipeline built in Part II is designed for extension. The next steps are to integrate the canonical PV and RRN wrappers into the data controller's extract workflow, define key rotation procedures per researcher batch, and test the pipeline against realistic multi-zone Belgian police extracts. The goal is a documented, repeatable release procedure that any police zone data manager can follow without specialist programming knowledge.
Reproducible implementation and risk auditing The risk audit framework — quasi-identifier analysis, k-anonymity checks, suppression logging — will be packaged as a standalone module (implemented in R or Python, depending on the operational environment) that the data controller can run before each release. I plan to document the module so that it satisfies the GDPR Art. 30 processing register obligation and can be presented to a Data Protection Officer as structured evidence of the technical safeguards in place.
Security note on key mapping table: A mapping between real identifiers (RRN, PV numbers) and their pseudonyms has been written to a secure folder accessible only to the data controller. This file is for police system use only. It must never be shared with researchers or stored in the research environment. In production, this table would be held in the data controller's HSM or certified key vault, satisfying GDPR Art. 32.
All names, RRN numbers, and case details in this section are fully synthetic. No real persons are represented.
Belgian Federal Government. (1992). Wet op het Politieambt [Belgian Police Act]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/
Belgian Federal Government. (2018). Wet van 30 juli 2018 betreffende de bescherming van natuurlijke personen met betrekking tot de verwerking van persoonsgegevens [Law of 30 July 2018 on the protection of natural persons with regard to the processing of personal data]. Belgisch Staatsblad. https://www.ejustice.just.fgov.be/
Bernasco, W., & Nieuwbeerta, P. (2005). How do residential burglars select target areas? A new approach to the analysis of criminal location choice. British Journal of Criminology, 45(3), 296-315. https://doi.org/10.1093/bjc/azh070
European Union Agency for Cybersecurity. (2019). Pseudonymisation techniques and best practices: Recommendations on shaping technology according to data protection and privacy provisions. Publications Office of the European Union. https://www.enisa.europa.eu/publications/pseudonymisation-techniques-and-best-practices
European Parliament and Council of the European Union. (2016a). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88. https://eur-lex.europa.eu/eli/reg/2016/679/oj
European Parliament and Council of the European Union. (2016b). Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data. Official Journal of the European Union, L 119, 89-131. https://eur-lex.europa.eu/eli/dir/2016/680/oj
Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23(7), 1025-1044. https://doi.org/10.1068/a231025
Kuralarasan, K., & Bernasco, W. (2022). Location choice of snatching offenders in Chennai City. Journal of Quantitative Criminology, 38, 673-696. https://doi.org/10.1007/s10940-021-09514-9
Openshaw, S. (1984). The modifiable areal unit problem (Concepts and Techniques in Modern Geography No. 38). Geo Books.
Song, G., Bernasco, W., Liu, L., Xiao, L., Zhou, S., & Liao, W. (2019). Crime feeds on legal activities: Daily mobility flows help to explain thieves' target location choices. Journal of Quantitative Criminology, 35(4), 831-854. https://doi.org/10.1007/s10940-019-09406-z
Steenbeek, W., & Weisburd, D. (2016). Where the action is in crime? An examination of variability of crime across different spatial units in The Hague, 2001-2009. Journal of Quantitative Criminology, 32(3), 449-469. https://doi.org/10.1007/s10940-015-9276-3
Sweeney, L. (2000). Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. http://dataprivacylab.org/projects/identifiability/paper1.pdf
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570. https://doi.org/10.1142/S0218488502001648
Weisburd, D., Groff, E. R., & Yang, S.-M. (2012). The criminology of place: Street segments and our understanding of the crime problem. Oxford University Press.
© 2026 Dr. Kuralarasan Kumar. This document and its methodology are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). All synthetic data used in Part II are fully artificial; no real persons are represented.