We developed a novel method and a protocol for on-the-fly de-identification of structured Clinical/Epic/PHI data. This approach provides a complete administrative control over the risk for data identification when sharing large clinical cohort-based medical data. At the extremes, null data may be shared or a completely identifiable data can be provided (depending on access level, research needs, etc.) For pilot studies, the data office can dial up security (and naturally devaluing the data), whereas for promising pilot results, data governors may provide a more balanced dataset trading security and scientific value/impact. In a nutshell, out technique allows Health Systems to filler, export, package and share virtually any and all clinical and medical data for a population cohort that is requested by a researcher interested in examining specific healthcare, biomedical, or translational characteristics of multivariate clinical data. Currently, there are no practical, scientifically reliable, and effective mechanisms to share real clinical data containing no clearly identifiable PHI) without compromising either 1) the value of the data (by excessively scrambling/encoding) or 2) introducing a substantial risk for re-identification of individuals by various stratification techniques.
The DataSifter protocol is based on an algorithm that involves (data-governor controlled) iterative data manipulation that stochastically identifies candidate entries from the cases (subjects, participants) and features/variables (data elements) and subsequently selects, nullifies, and imputes the information. This process heavily relies on statistical multivariate imputation to preserve the joint distributions of the complex structured data archive. At each step, the algorithm generates a complete dataset that in aggregate closely resembles the intrinsic characteristics of the original cohort, however, on an individual level the rows of data are substantially altered. This procedure drastically reduces the risk for subject re-identification by stratification, as meta-data for all subjects is repeatedly and lossily encoded. Mathematical modeling, statistical techniques, probabilistic (re)sampling and imputation methods take important part in the proposed DataSifter protocol.
The main applications of this technology include:
Sharing EMR Data: Allowing researchers and non-clinical investigator access to sensitive lossily to promote scientific modeling, rapid interrogation, exploratory and confirmatory analytics, and discovery of complex biomedical processes, health conditions, and biomedical phenotypes.
Sharing Biosocial Data: Allowing engineers and data scientists to examine sensitive CMS/Medicare/Census/HRS data without compromising the risk for participant re-identification.
Other government data (e.g., IRS) may similarly be shared, as DataSifter outputs, without privacy concerns.
The new method has some advantages:
This algorithm has computationally efficient implementations - a feature that is attractive for data governors that deal with large volume of data inquiry requests.
On the individual person level, the DataSifter scrambles sensitive information that can be used for stratification based data re-identification attacks (thus reducing security risks), however, it preserves the joint distribution of the entire cohort-based data (thus, facilitating the urgent need to expedite data interrogation, derive actionable knowledge, and enable rapid decision support)
As the data governors can keep their mapping between the native subject identifiers (e.g., EMR, SSN) and the study-specific subject IDs (sequential or random), the size and complexity of the data collection may easily be extended at a later point in time (e.g., to add additional longitudinal data augmenting previously provided DataSifter data).
In the implementation of DataSifter, we used SOCR libraries (http://socr.umich.edu/html/SOCR_CitingLicense.html) and R software (https://www.r-project.org/Licenses/).
Ideally we want to know 5 lists of feature types as an input:
LIST1: List of features to remove for privacy or other reasons.
LIST2: List of dates.
LIST3: List of categorical features (i.e., binomial/multinomial).
LIST4: List of unstructured features (e.g., medicine taken with a dose spec and type of release).
LIST5: List of features with known dependencies (e.g., bmi = f(weight, height)), or temporal correlation (e.g., weight/height of a kid over time).
List1 is provided by the data governor. After the features in List1 are removed, constant features are also removed if present in the dataset.
List2 of date features is provided by the data governor. The obfuscation of date features will be done by complying with the time resolution \(\Delta{t}\) requested (e.g., years, months, weeks, days, hours, minutes,….). Right now we obfuscate with \(\Delta{t}=year\).
List3 of categorical features is automated, enforcing the following criteria for defining a categorical feature \(x\) (here \(N\) represents the total number of cases):
if \((unique(x) > [3*log(N)])\), then \(x\) IS NOT CATEGORICAL.
List4 and List5 are provided by the data governor.
This step can calculate a variety of dissimilarity or distance metrics between cases/subjects. The function distance() in the ecodist package is written for extensibility and understandability, and it may not be efficient for use with large matrices. Initially, we only select numeric features. Later we may need to expand this to select factors/categorical features and strings/unstructured features. The default distance is the Bray-Curtis distance (more stable/fewer singularities). The results of distance() is a lower-triangular distance matrix as an object of class “dist”. With \(N\) representing the total number of cases, then the size of the distance vector is \(N(N-1)/2\).
The first component of the DataSifter is k0.
k0: binary option [0,1] to swap/obfuscate or not the unstructured feature defined in list4. It is not relevant for computing distance between cases.
The second and third component of the DataSifter are k1 and k2. They are linked since the obfuscation step k1 is repeated as many times as specified in k2.
k1: introduction of % of artificial missing values + imputation. In this example we are using values between 0% up to 40% with increments of 5%.
k2: how many time to repeat k1. In this example we are using 5 options: [0,1,2,3,4].
The fourth element of the data sifter is k3.
k3: fraction of features (among all the feaures except the unstructured ones) to swap/obfuscate on ALL the cases. For each case, the case to be swapped is chosen from a certain radius (distance, see the fifth component k4). In this example we are using values between 0% up to 100% with increments of 5%.
The fifth element of the data sifter is k4.
k4: the swapping step k3 is performed on each case by sampling among the k4-percentile of its neighbors (fraction of closest neighbors from which to select the case to be swapped). In this example we are using values between 1% up to 100% with increments of 1%. We are not computing the contribution of k4 into the Data Sifter slider \(\eta\) yet.
The selection of the values for k0-k4 is done in Section 1.9 below of this Rmd script.
## Pick a value from each of the K_options below and type it into the k_raw
k0_options <- c(0,1)
k1_options <- c(0,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40)
k2_options <- c(0,1,2,3,4)
k3_options <- c(0,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1)
k4_options <- seq(0.01,1,0.01)# These are the values selected for this example
k0 <- 0 # swapping of unstructured data features
k1 <- 0.3 # % of data removal from the entire dataset (except unstructured features)
k2 <- 1 # how many times to perform imputation on the "disrupted" dataset after k1
k3 <- 0.8 # swapping of structured features (right now done on ALL features except the unstructured ones)
k4 <- 0.3 # % of closest neighbors to choose from## [1] "Current k0-k4 options selected for the Data Sifter Slider"
## [1] 0.0 0.3 1.0 0.8 0.3
## [1] "Current value of the Data Sifter Slider"
## [1] 0.74932
## [1] "Distance Sifted Data from Raw Data"
## [1] 39186.53
We compare below 2 records across 4 features. This step selects only double features for the record and the beanplot comparisons. It samples from the selected features, 4 features that are numeric and generates a table with Raw vs Sifted records, as well as beanplots for comparing the distributions. The goal of the Data Sifter is to obfuscate single records without disrupting the overall distribution of features. The magnitude of obfuscation is set by the five options selected to generate the sifter slider normalized value \(\eta\) between 0 and 1. Sometimes, since we swap feature values between cases that are within a certain radius (set by k4), there is a possibility to swap back and forth between the same cases. We will eventually place a control on this occurence, if necessary.
## hglucose postop_temp_c edvisit_first_icd9 val_age
## Case 1 Raw 111.00 0.000 194.00 73.00
## Case 1 Sifted 145.96 0.000 12.83 63.34
## Case 2 Raw 179.00 36.300 1.00 89.00
## Case 2 Sifted 206.00 0.735 1.00 50.00