Background

We developed a novel method and a protocol for on-the-fly de-identification of structured Clinical/Epic/PHI data. This approach provides a complete administrative control over the risk for data identification when sharing large clinical cohort-based medical data. At the extremes, null data may be shared or a completely identifiable data can be provided (depending on access level, research needs, etc.) For pilot studies, the data office can dial up security (and naturally devaluing the data), whereas for promising pilot results, data governors may provide a more balanced dataset trading security and scientific value/impact. In a nutshell, out technique allows Health Systems to filler, export, package and share virtually any and all clinical and medical data for a population cohort that is requested by a researcher interested in examining specific healthcare, biomedical, or translational characteristics of multivariate clinical data.

…..

## [1] "% of missing data in the Raw Data"

## [1] 23.66444

## This step eliminates any subject that has missing data (for the purpose of the pilot only)
## Real missing data issues will be dealt with at a later stage in the DataSifter, after categorical 
## and nonstructured features are listed
pilot_data.complete <- na.omit(pilot_data[,-1]) # remove first index column and missing cases
#View(pilot_data.complete)

List of features to be handled

Ideally we want to know 3 lists of feature types as an input 1) LIST1: List of features to remove for privacy or other important reasons (this may/should be done at the source) 2) LIST2: List of dates 3) LIST3: List of categorical features (i.e., binomial/multinomial) 4) LIST4: List of unstructured features (e.g., medicine taken with a dose spec and typoe of release)

List2 of DATE features can be automated (as long as DATE is in the name). For example, to check which columns are dates, we can use the following statement in the code: sapply(pilot_data.complete, function(x) !all(is.na(as.Date(as.character(x),format=“%d/%m/%Y”))))

List3 of categorical features can somewhat be automated, but it won’t be 100% accurate.

List4 of unstructured data can be inferred by exclusion from points 1) and 2).

# This is an example of a list of features to remove (point 1) above).
list1=c("ENCRYPTED_MRN","LAST_ADHD_MED_NAME", "FIRST_ADHD_MED_NAME", "ID", "GROUP",
        "wt_flag_ivo","ht_flag_ivo","bmi_flag_ivo","new_bmi_flag_ivo","HEIGHT_FLAG","WEIGHT_FLAG")

# This is where I remove the features in LIST1
pilot_data.cl <- pilot_data.complete[ , -which(names(pilot_data.complete) %in% list1)] 

# This is how we retrieve a list of features that represents dates. Note that the only check here is to look 
# for the word DATE in the names and then make the format of the features as.Date, matching the
# format = "%m/%d/%Y" and recasting it as "%Y-%M-%D" according to as.Date (e.g., 3/20/2014 --> 2014-03-20)
pilot_data.cl [, cols <- grepl("DATE", names(pilot_data.cl))] <- 
              lapply(pilot_data.cl[, cols <- grepl("DATE",names(pilot_data.cl))], as.Date, format = "%m/%d/%Y")

# This step can be part of the Sifter, i.e. convert Date --> to Year ONLY
# For this type conversion (coercion), the left hand side of the assign (pilot_data.cl[sapply(.)])
# subsets the columns that are either not-numeric or not-logical.
# The right hand side, for that subset, uses lapply to perform the conversion you need to do.
# R is smart enough to replace the original columns with the results.
pilot_data.cl [, cols <- grepl("DATE", names(pilot_data.cl))] <- 
  lapply(pilot_data.cl[, cols <- grepl("DATE",names(pilot_data.cl))], year)
## This is the list of date features (point 2) above).
list2_cols=which(grepl("DATE",names(pilot_data.cl)))
# and these are the feature/column names
list2=names(pilot_data.cl)[which(grepl("DATE",names(pilot_data.cl)))]

## This is the list of categorical features (point 3) above).
list3a=c("GENDER") # binomial M/F
list3b=c("TONCIL_GE12","TONCIL_LT12","ADENO_TONCIL_GE12","ADENO_TONCIL_LT12")  # binomial Y/N
list3a_cols <- which(names(pilot_data.cl) %in% list3a)
list3b_cols <- which(names(pilot_data.cl) %in% list3b)

## This step recasts the binomial from letters to 0/1
pilot_data.cl[ , which(names(pilot_data.cl) %in% list3a)] <- 
  (ifelse(pilot_data.cl[ , which(names(pilot_data.cl) %in% list3a)] == "M" ,1,0))
pilot_data.cl[ , which(names(pilot_data.cl) %in% list3b)] <- 
  (ifelse(pilot_data.cl[ , which(names(pilot_data.cl) %in% list3b)] == "Y" ,1,0))

## This is the list of unstructured features
list4=c("MED_NAME","DEP_MED_NAME")
list4_cols=(which(grepl("NAME",names(pilot_data.cl))))

Similarity/Distance metric between cases

This step can calculate a variety of dissimilarity or distance metrics between cases/subjects. The function distance() in the ecodist package is written for extensibility and understandability, and it may not be efficient for use with large matrices. Initially, only select the “numerical features - later may need to expand this to select strings The default distance is the Gray-Curtis distance (more stable/fewer singularities). The results of distance() is a lower-triangular distance matrix as an object of class”dist" with size ([# of cases]*[# of cases - 1])/2.

# pilot_data.md <- distance(pilot_data.cl[,1:31], "mahal") # Mahalanobis distance
# pilot_data.bc <- distance(pilot_data.cl[,1:31]) # default: Gray-Curt distance
# Plot the Mahalanobis and Bray-Curtis dissimilarities
#pilot_data.md <- distance(pilot_data.cl[sapply(pilot_data.cl, is.numeric)], "mahal") # Mahalanobis distance
# This step calculates the distance only for numeric features
pilot_data.bc <- distance(pilot_data.cl[sapply(pilot_data.cl, is.numeric)])

# To retrieve which subjects are closest (i.e., more similar to each other),
# first we determine the list of possible combinations,
comb_subjects=combn(dim(pilot_data.cl)[1],2)
# then we generate a data frame with the ID of the distances,
df_bc_dist <- as.data.frame(cbind(seq(1,length(comb_subjects)/2,1),pilot_data.bc))
names(df_bc_dist) <- c("Case","Distance")
# and we order the BC metrics,
df_bc_dist_temp <- df_bc_dist[order(df_bc_dist$Distance),]
# We then select a subset to swap unstructured data on, based on the criteria below.
# Below we can substitute which(df_bc_dist_temp$Distance < 0.25)
# Or just take a % (e.g., top 10%).
# As an example I show the identical cases (i.e., distance == 0) and
# similar cases (i.e., distance between 0 and 1)
case_identical <- which(df_bc_dist_temp$Distance == 0) 
case_similar <- which(df_bc_dist_temp$Distance > 0 & df_bc_dist_temp$Distance < 1)

First component of the sifter slider: swapping module k0

First component of the slider - k0: % of similar cases on which to swap unstructured feature values from list4. I am thinking of 5 options: c(0,0.1,0.2,0.3,0.4).

## Here I set it to 20%
k0 <- 0.2
case_identical_subset <- sample(case_identical,round(length(case_identical)*k0))
case_similar_subset <- sample(case_similar,round(length(case_similar)*k0))

# Swapping module
A <- comb_subjects[,df_bc_dist_temp$Case[case_similar_subset]]
B<-pilot_data.cl
#B[,list4_cols]
for (i in 1:length(case_similar_subset))
{
  # print(i)
  # print(B[A[,i][1],list4_cols])
  # print(B[A[,i][2],list4_cols])
  a1 <- B[A[,i][1],list4_cols] 
  a2 <- B[A[,i][2],list4_cols]
  
  B[A[,i][1],list4_cols] <- a2
  B[A[,i][2],list4_cols] <- a1
  # A[,i][1]
  # A[,i][2]
  #print(B[A[,i][1],list4_cols])
  #print(B[A[,i][2],list4_cols])
  #readline("Press <return to continue")
}

## Dataset with the first filter k0 applied (swap of unstructured features between similar cases)
pilot_data.cl_k0 <- B

Second and third component of the sifter slider: cycles of missing values and imputation.

This next step loops through two conditions of the DataSifter slider, namely: k1: introduction of % of artificial missing values + imputation. I am thinking of 4 options: c(0,0.10,0.20,0.30) k2: how many time to repeat k1. I am thinking of 5 options: c(0,1,2,3,4)

## Here I use k1=10% and k2=2
k1=0.3
k2=3
# Before performing the k1-k2 slider conditions, we neglect the unstructured features (list4)
# and the dates (list2)
list5_cols <- 1:length(pilot_data.cl_k0)
#list5_cols <- list5_cols[-c(list2_cols,list4_cols)]
list5_cols <- list5_cols[-c(list4_cols)]
C <- pilot_data.cl_k0[,list5_cols]
  for (j in 1:k2)
{
  # Here I introduce k1 missing data
  C <- prodNA(C, noNA = k1)
  # Here I impute with the function missForest
  C <- missForest(C, maxiter = 1)
  C <- C$ximp
}

##   missForest iteration 1 in progress...done!
##   missForest iteration 1 in progress...done!
##   missForest iteration 1 in progress...done!

## Dataset with the second and third filter k1 and k2 applied (see above)
pilot_data.cl_k2 <- pilot_data.cl_k0
pilot_data.cl_k2[,list5_cols] <- C
pilot_data.cl_k2[,c(list2_cols,list4_cols)] <- pilot_data.cl_k0[,c(list2_cols,list4_cols)]
## This step rounds features that are supposed to be integers (I need to change it from the time
## I generate the data frame and set these features to integers)
pilot_data.cl_k2[,7:14] <- round(pilot_data.cl_k2[,7:14])

Comparind Raw and Sifted Data distributions

Now the validation step will compare the distributions of the raw dataset (in this example pilot_data.cl) with the distributions of the sifted dataset (in this example pilot_data.cl_k2) by using violin/bean plots.

K <- list(c(0,0.1,0.2,0.3,0.4),100*c(0,0.10,0.20,0.30),c(0,1,2,3,4))
K_comb <- expand.grid(K)
for (i in 1:dim(K_comb)[1])
{
K_comb[i,4] <- sum(K_comb[i,1:3])
}
names(K_comb) <- c("k0","k1","k2","Combined")
K_comb_ordered <- K_comb[order(K_comb$Combined),]
K_comb_ordered$SliderValue <- sort(max(K_comb_ordered$Combined) - K_comb_ordered$Combined) /max(K_comb_ordered$Combined)
slider <- k0+100*k1+k2
sliderValue <- K_comb_ordered$SliderValue[which(K_comb_ordered$Combined == slider)]
print("In this example the slider is set to")

## [1] "In this example the slider is set to"

print(sliderValue)

## [1] 0.9651163

Raw_Data <- cbind(pilot_data.cl,1)
Sifted_Data <- cbind(pilot_data.cl_k2,2)
names(Raw_Data)[length(names(Raw_Data))] <- names(Sifted_Data)[length(names(Sifted_Data))] <- c("Protocol")

#Data=rbind(Raw_Data,Sifted_Data)
Data=rbind(Raw_Data[,-c(list2_cols,list3a_cols,list3b_cols,list4_cols)],Sifted_Data[,-c(list2_cols,list3a_cols,list3b_cols,list4_cols)])
for (j in 1:(length(Data)-1))
{
  beanplot(Data[,j] ~ Data$Protocol, ll = 0.04,
           main = "Data Sifter Pilot Study", side = "both", xlab="Features",
           col = list("purple", c("lightblue", "black")),
           axes=F)
  axis(1, at=c(1),  labels=c(names(Data)[j]))
  axis(2)
  legend("bottomright", fill = c("purple", "lightblue"),
         legend = c("Raw Data", "Sifted Data"), box.lty=0)
  #readline("Press <return to continue")
}

DATA SIFTER v0 - PILOT STUDY

Simeone Marino & Ivo Dinov