Background

We developed a novel method and a protocol for on-the-fly de-identification of structured Clinical/Epic/PHI data. This approach provides a complete administrative control over the risk for data identification when sharing large clinical cohort-based medical data. At the extremes, null data may be shared or a completely identifiable data can be provided (depending on access level, research needs, etc.) For pilot studies, the data office can dial up security (and naturally devaluing the data), whereas for promising pilot results, data governors may provide a more balanced dataset trading security and scientific value/impact. In a nutshell, out technique allows Health Systems to filler, export, package and share virtually any and all clinical and medical data for a population cohort that is requested by a researcher interested in examining specific healthcare, biomedical, or translational characteristics of multivariate clinical data. Currently, there are no practical, scientifically reliable, and effective mechanisms to share real clinical data containing no clearly identifiable PHI) without compromising either 1) the value of the data (by excessively scrambling/encoding) or 2) introducing a substantial risk for re-identification of individuals by various stratification techniques.

Technology

The DataSifter protocol is based on an algorithm that involves (data-governor controlled) iterative data manipulation that stochastically identifies candidate entries from the cases (subjects, participants) and features/variables (data elements) and subsequently selects, nullifies, and imputes the information. This process heavily relies on statistical multivariate imputation to preserve the joint distributions of the complex structured data archive. At each step, the algorithm generates a complete dataset that in aggregate closely resembles the intrinsic characteristics of the original cohort, however, on an individual level the rows of data are substantially altered. This procedure drastically reduces the risk for subject re-identification by stratification, as meta-data for all subjects is repeatedly and lossily encoded. Mathematical modeling, statistical techniques, probabilistic (re)sampling and imputation methods take important part in the proposed DataSifter protocol.

Applications

The main applications of this technology include:

Sharing EMR Data: Allowing researchers and non-clinical investigator access to sensitive lossily to promote scientific modeling, rapid interrogation, exploratory and confirmatory analytics, and discovery of complex biomedical processes, health conditions, and biomedical phenotypes.
Sharing Biosocial Data: Allowing engineers and data scientists to examine sensitive CMS/Medicare/Census/HRS data without compromising the risk for participant re-identification.
Other government data (e.g., IRS) may similarly be shared, as DataSifter outputs, without privacy concerns.

Advantages

The new method has some advantages:

This algorithm has computationally efficient implementations - a feature that is attractive for data governors that deal with large volume of data inquiry requests.
On the individual person level, the DataSifter scrambles sensitive information that can be used for stratificationbased data re-identification attacks (thus reducing security risks), however, it preserves the joint distribution of the entire cohort-based data (thus, facilitating the urgent need to expedite data interrogation, derive actionable knowledge, and enable rapid decision support)
As the data governors can keep their mapping between the native subject identifiers (e.g., EMR, SSN) and the study-specific subject IDs (sequential or random), the size and complexity of the data collection may easily be extended at a later point in time (e.g., to add additional longitudinal data augmenting previously provided DataSifter data).

Software/materials

In the implementation of DataSifter, we used SOCR libraries (http://socr.umich.edu/html/SOCR_CitingLicense.html) and R software (https://www.r-project.org/Licenses/).

Notes

In terms of implementation, repeated user requests for the same cohort data (e.g., different values) should be discouraged, viewed with suspicion, as multiple copies of different iterations of DataSifter-generated data of the same cohort may be cleverly merged in an effort to reconstruct the original (protected) data.
www.DataSifter.org domain is acquired and available (from SOCR Group) to advertise and share the technology.
Handling unstructured data, strings or non-ordinal data o As an initial proof-of-concept approach, we will randomly swap such features between cases/participants, subject to determining close pairs of cases using appropriate distance metrics defined on the structured quantitative features in the data. Later, we will use NLP and ML methods for transforming the unstructured data into structured data elements that can be jointly utilized in the DataSifting process.

List of features to be handled

Ideally we want to know 3 lists of feature types as an input:

LIST1: List of features to remove for privacy or other reasons (this is provided by the data governor)
LIST2: List of dates
LIST3: List of categorical features (i.e., binomial/multinomial)
LIST4: List of unstructured features (e.g., medicine taken with a dose spec and type of release)

List1 should be supplied by the data governor or even filter out before any data is given out. After we remove the featurs in List1, we also check if constant features are present in the dataset. If yes, they will be removed as well.

List2 of DATE features is automated (as long as DATE or a date-associated label is present in the header of the feature). The obfuscation of DATE features will be done by compying with the time resolution requested. Right now we obfuscate leaving only the year.

List3 of categorical features is now automated, enforcing the following criteria for defining a categorical variable:

if (unique()/[3*log(#cases)])>1, then NOT CATEGORICAL.

List4 of unstructured data is provided by the data governor but can be automated.

Similarity/Distance metric between cases

This step can calculate a variety of dissimilarity or distance metrics between cases/subjects. The function distance() in the ecodist package is written for extensibility and understandability, and it may not be efficient for use with large matrices. Initially, we only select numeric features. Later we may need to expand this to select factors/categorical features and strings/unstructured features. The default distance is the Bray-Curtis distance (more stable/fewer singularities). The results of distance() is a lower-triangular distance matrix as an object of class “dist” with size ([# of cases]*[# of cases - 1])/2.

Four Components of the Data Sifter

First component: k0 (not relevant for distance between cases)

The first component of the DataSifter is k0 and it determines if we want to obfuscate unstructured unstructured feature defined in list4.k0 is the a binary option: [0,1].

Second and third component: cycles of missing values and imputation

The second and third component of the DataSifter are k1 and k2. They are linked since the obfuscation step k1 is repeated as many times as specified in k2.

k1: introduction of % of artificial missing values + imputation. In this example we are using values between 0% up to 40% with increments of 5%.

k2: how many time to repeat k1. In this example we are using 5 options: [0,1,2,3,4].

Forth component: obfuscation slider k3

The fourth element of the data sifter is k3. It represents the fraction of features/cases to obfuscate on ALL the cases. In this example we are using values between 0% up to 100% with increments of 5%.

## [1] "Current k0-k3 options selected for the Data Sifter Slider"

## [1] 0.2 0.0 1.0 0.5

## [1] "Current value of the Data Sifter Slider"

## [1] 0.4487761

## [1] "No imputation necessary, steps k1 and k2 skipped"

## [1] "Distance Sifted Data from Raw Data"

## [1] 4050291

##                                  [,1]        [,2]
## bmi                      2.646673e+01  1.09658973
## weight_lb               -5.100500e-01 -0.19884334
## glucose1                -4.767445e-02 -0.05191434
## hcreatinine              5.416449e-02 -0.04780429
## lhematocrit              1.290495e-01  0.10656088
## bloodglucose             7.987260e-02  0.03506021
## creatinine               1.069362e+00  0.94596397
## hct                     -2.419010e-02  0.03804396
## inr                      5.382232e+00  0.91458484
## platecount              -7.155486e-02 -0.06284009
## wbc                      2.103844e-01  0.18599646
## intraop_temp_celsius    -2.500595e+00 -0.44385678
## intraop_temperature     -1.663462e+00 -0.22103861
## intraop_temp_fahrenheit            NA -0.27568645
## postop_temp_celsius      4.947372e+00  7.31584732
## postop_temperature       2.413219e+00 -3.32989841
## postop_temp_fahrenheit             NA  1.68577459
## val_bmi                 -2.379922e+01  0.41415898
## val_nonsurgtime         -3.881009e-02 -0.03425127
## val_surgtime            -9.713865e-04  0.01759706
## postop_temp_c                      NA          NA
## postop_temp_f                      NA  2.77227772

DataSifter example: from zero to max data obfuscation

Comparing Raw and Sifted Data distributions

##      postop_temperature hcreatinine platecount glucose1
## 604                97.2        1.00        228      125
## 193                97.7        0.90        126      163
## 6041               36.5        0.99        214      165
## 1931               36.5        0.75        257      131

SOCR Electronic Medical Record DataSifter

Simeone Marino & Ivo Dinov

March 58 2017