We developed a novel method and a protocol for on-the-fly de-identification of structured Clinical/Epic/PHI data. This approach provides a complete administrative control over the risk for data identification when sharing large clinical cohort-based medical data. At the extremes, null data may be shared or a completely identifiable data can be provided (depending on access level, research needs, etc.) For pilot studies, the data office can dial up security (and naturally devaluing the data), whereas for promising pilot results, data governors may provide a more balanced dataset trading security and scientific value/impact. In a nutshell, out technique allows Health Systems to filler, export, package and share virtually any and all clinical and medical data for a population cohort that is requested by a researcher interested in examining specific healthcare, biomedical, or translational characteristics of multivariate clinical data. Currently, there are no practical, scientifically reliable, and effective mechanisms to share real clinical data containing no clearly identifiable PHI) without compromising either 1) the value of the data (by excessively scrambling/encoding) or 2) introducing a substantial risk for re-identification of individuals by various stratification techniques.
The DataSifter protocol is based on an algorithm that involves (data-governor controlled) iterative data manipulation that stochastically identifies candidate entries from the cases (subjects, participants) and features/variables (data elements) and subsequently selects, nullifies, and imputes the information. This process heavily relies on statistical multivariate imputation to preserve the joint distributions of the complex structured data archive. At each step, the algorithm generates a complete dataset that in aggregate closely resembles the intrinsic characteristics of the original cohort, however, on an individual level the rows of data are substantially altered. This procedure drastically reduces the risk for subject re-identification by stratification, as meta-data for all subjects is repeatedly and lossily encoded. Mathematical modeling, statistical techniques, probabilistic (re)sampling and imputation methods take important part in the proposed DataSifter protocol.
The main applications of this technology include:
Sharing EMR Data: Allowing researchers and non-clinical investigator access to sensitive lossily to promote scientific modeling, rapid interrogation, exploratory and confirmatory analytics, and discovery of complex biomedical processes, health conditions, and biomedical phenotypes.
Sharing Biosocial Data: Allowing engineers and data scientists to examine sensitive CMS/Medicare/Census/HRS data without compromising the risk for participant re-identification.
Other government data (e.g., IRS) may similarly be shared, as DataSifter outputs, without privacy concerns.
The new method has some advantages:
This algorithm has computationally efficient implementations - a feature that is attractive for data governors that deal with large volume of data inquiry requests.
On the individual person level, the DataSifter scrambles sensitive information that can be used for stratificationbased data re-identification attacks (thus reducing security risks), however, it preserves the joint distribution of the entire cohort-based data (thus, facilitating the urgent need to expedite data interrogation, derive actionable knowledge, and enable rapid decision support)
As the data governors can keep their mapping between the native subject identifiers (e.g., EMR, SSN) and the study-specific subject IDs (sequential or random), the size and complexity of the data collection may easily be extended at a later point in time (e.g., to add additional longitudinal data augmenting previously provided DataSifter data).
In the implementation of DataSifter, we used SOCR libraries (http://socr.umich.edu/html/SOCR_CitingLicense.html) and R software (https://www.r-project.org/Licenses/).
Ideally we want to know 3 lists of feature types as an input:
LIST1: List of features to remove for privacy or other reasons (this is provided by the data governor)
LIST2: List of dates
LIST3: List of categorical features (i.e., binomial/multinomial)
LIST4: List of unstructured features (e.g., medicine taken with a dose spec and type of release)
List1 should be supplied by the data governor or even filter out before any data is given out. After we remove the featurs in List1, we also check if constant features are present in the dataset. If yes, they will be removed as well.
List2 of DATE features is automated (as long as DATE or a date-associated label is present in the header of the feature). The obfuscation of DATE features will be done by compying with the time resolution requested. Right now we obfuscate leaving only the year.
List3 of categorical features is now automated, enforcing the following criteria for defining a categorical variable:
if (unique()/[3*log(#cases)])>1, then NOT CATEGORICAL.
List4 of unstructured data is provided by the data governor but can be automated.
This step can calculate a variety of dissimilarity or distance metrics between cases/subjects. The function distance() in the ecodist package is written for extensibility and understandability, and it may not be efficient for use with large matrices. Initially, we only select numeric features. Later we may need to expand this to select factors/categorical features and strings/unstructured features. The default distance is the Bray-Curtis distance (more stable/fewer singularities). The results of distance() is a lower-triangular distance matrix as an object of class “dist” with size ([# of cases]*[# of cases - 1])/2.
The first component of the DataSifter is k0 and it determines if we want to obfuscate unstructured unstructured feature defined in list4.k0 is the a binary option: [0,1].
The second and third component of the DataSifter are k1 and k2. They are linked since the obfuscation step k1 is repeated as many times as specified in k2.
k1: introduction of % of artificial missing values + imputation. In this example we are using values between 0% up to 40% with increments of 5%.
k2: how many time to repeat k1. In this example we are using 5 options: [0,1,2,3,4].
The fourth element of the data sifter is k3. It represents the fraction of features/cases to obfuscate on ALL the cases. In this example we are using values between 0% up to 100% with increments of 5%.
## [1] "Current k0-k3 options selected for the Data Sifter Slider"
## [1] 0.2 0.0 1.0 0.5
## [1] "Current value of the Data Sifter Slider"
## [1] 0.4487761
## [1] "No imputation necessary, steps k1 and k2 skipped"
## [1] "Distance Sifted Data from Raw Data"
## [1] 4050291
## [,1] [,2]
## bmi 2.646673e+01 1.09658973
## weight_lb -5.100500e-01 -0.19884334
## glucose1 -4.767445e-02 -0.05191434
## hcreatinine 5.416449e-02 -0.04780429
## lhematocrit 1.290495e-01 0.10656088
## bloodglucose 7.987260e-02 0.03506021
## creatinine 1.069362e+00 0.94596397
## hct -2.419010e-02 0.03804396
## inr 5.382232e+00 0.91458484
## platecount -7.155486e-02 -0.06284009
## wbc 2.103844e-01 0.18599646
## intraop_temp_celsius -2.500595e+00 -0.44385678
## intraop_temperature -1.663462e+00 -0.22103861
## intraop_temp_fahrenheit NA -0.27568645
## postop_temp_celsius 4.947372e+00 7.31584732
## postop_temperature 2.413219e+00 -3.32989841
## postop_temp_fahrenheit NA 1.68577459
## val_bmi -2.379922e+01 0.41415898
## val_nonsurgtime -3.881009e-02 -0.03425127
## val_surgtime -9.713865e-04 0.01759706
## postop_temp_c NA NA
## postop_temp_f NA 2.77227772
## postop_temperature hcreatinine platecount glucose1
## 604 97.2 1.00 228 125
## 193 97.7 0.90 126 163
## 6041 36.5 0.99 214 165
## 1931 36.5 0.75 257 131