We developed a novel method and a protocol for on-the-fly de-identification of structured Clinical/Epic/PHI data. This approach provides a complete administrative control over the risk for data identification when sharing large clinical cohort-based medical data. At the extremes, null data may be shared or a completely identifiable data can be provided (depending on access level, research needs, etc.) For pilot studies, the data office can dial up security (and naturally devaluing the data), whereas for promising pilot results, data governors may provide a more balanced dataset trading security and scientific value/impact. In a nutshell, out technique allows Health Systems to filler, export, package and share virtually any and all clinical and medical data for a population cohort that is requested by a researcher interested in examining specific healthcare, biomedical, or translational characteristics of multivariate clinical data. Currently, there are no practical, scientifically reliable, and effective mechanisms to share real clinical data containing no clearly identifiable PHI) without compromising either 1) the value of the data (by excessively scrambling/encoding) or 2) introducing a substantial risk for re-identification of individuals by various stratification techniques.
The DataSifter protocol is based on an algorithm that involves (data-governor controlled) iterative data manipulation that stochastically identifies candidate entries from the cases (subjects, participants) and features/variables (data elements) and subsequently selects, nullifies, and imputes the information. This process heavily relies on statistical multivariate imputation to preserve the joint distributions of the complex structured data archive. At each step, the algorithm generates a complete dataset that in aggregate closely resembles the intrinsic characteristics of the original cohort, however, on an individual level the rows of data are substantially altered. This procedure drastically reduces the risk for subject re-identification by stratification, as meta-data for all subjects is repeatedly and lossily encoded. Mathematical modeling, statistical techniques, probabilistic (re)sampling and imputation methods take important part in the proposed DataSifter protocol.
The main applications of this technology include:
Sharing EMR Data: Allowing researchers and non-clinical investigator access to sensitive lossily to promote scientific modeling, rapid interrogation, exploratory and confirmatory analytics, and discovery of complex biomedical processes, health conditions, and biomedical phenotypes.
Sharing Biosocial Data: Allowing engineers and data scientists to examine sensitive CMS/Medicare/Census/HRS data without compromising the risk for participant re-identification.
Other government data (e.g., IRS) may similarly be shared, as DataSifter outputs, without privacy concerns.
The new method has some advantages:
This algorithm has computationally efficient implementations - a feature that is attractive for data governors that deal with large volume of data inquiry requests.
On the individual person level, the DataSifter scrambles sensitive information that can be used for stratificationbased data re-identification attacks (thus reducing security risks), however, it preserves the joint distribution of the entire cohort-based data (thus, facilitating the urgent need to expedite data interrogation, derive actionable knowledge, and enable rapid decision support)
As the data governors can keep their mapping between the native subject identifiers (e.g., EMR, SSN) and the study-specific subject IDs (sequential or random), the size and complexity of the data collection may easily be extended at a later point in time (e.g., to add additional longitudinal data augmenting previously provided DataSifter data).
In the implementation of DataSifter, we used SOCR libraries (http://socr.umich.edu/html/SOCR_CitingLicense.html) and R software (https://www.r-project.org/Licenses/).
Ideally we want to know 3 lists of feature types as an input:
LIST1: List of features to remove for privacy or other important reasons (this may/should be done at the source)
LIST2: List of dates
LIST3: List of categorical features (i.e., binomial/multinomial)
LIST4: List of unstructured features (e.g., medicine taken with a dose spec and typoe of release)
LIST5: List of numerical features (both integer and double, on which to calculate distances and entropy)
List1 should be supplied by the data governor or even filter out before any data is given out.
List2 of DATE features can be automated (as long as DATE is in the name).
List3 of categorical features is now automated, enforcing the following criteria for defining a categorical variable:
if (unique()/[3*log(#cases)])>1, then NOT CATEGORICAL.
List4 of unstructured data can be automated.
This step can calculate a variety of dissimilarity or distance metrics between cases/subjects. The function distance() in the ecodist package is written for extensibility and understandability, and it may not be efficient for use with large matrices. Initially, only select the numeric features (which includes integer such as dates, or total count of medicines). Later we may need to expand this to select strings. The default distance is the Gray-Curtis distance (more stable/fewer singularities). The results of distance() is a lower-triangular distance matrix as an object of class “dist” with size ([# of cases]*[# of cases - 1])/2.
The first component of the DataSifter is k0 and it determines the % of similar cases on which to swap unstructured feature values from list4.In this example we are using 5 options: [0,0.1,0.2,0.3,0.4].
The second and third component of the DataSifter are k1 and k2. They are linked since the obfuscation step k1 is repeated as many times as specified in k2.
k1: introduction of % of artificial missing values + imputation. In this example we are using values between 0% up to 80% with increments of 5%.
k2: how many time to repeat k1. In this example we are using 5 options: [0,1,2,3,4].
The fourth element of the data sifter is k3. It represents the fraction of features/cases to obfuscate by resampling their values from the empirical distributions. In this example we are using values between 0% up to 80% with increments of 5%.
## [1] "k0-k3 combination chosen"
## [1] 0.0 0.4 2.0 0.2