We developed a novel method and a protocol for on-the-fly de-identification of structured Clinical/Epic/PHI data. This approach provides a complete administrative control over the risk for data identification when sharing large clinical cohort-based medical data. At the extremes, null data may be shared or a completely identifiable data can be provided (depending on access level, research needs, etc.) For pilot studies, the data office can dial up security (and naturally devaluing the data), whereas for promising pilot results, data governors may provide a more balanced dataset trading security and scientific value/impact. In a nutshell, out technique allows Health Systems to filler, export, package and share virtually any and all clinical and medical data for a population cohort that is requested by a researcher interested in examining specific healthcare, biomedical, or translational characteristics of multivariate clinical data.
…..
## [1] "% of missing data in the Raw Data"
## [1] 23.66444
## This step eliminates any subject that has missing data (for the purpose of the pilot only)
## Real missing data issues will be dealt with at a later stage in the DataSifter, after categorical
## and nonstructured features are listed
pilot_data.complete <- na.omit(pilot_data[,-1]) # remove first index column and missing cases
#View(pilot_data.complete)
Ideally we want to know 3 lists of feature types as an input 1) LIST1: List of features to remove for privacy or other important reasons (this may/should be done at the source) 2) LIST2: List of dates 3) LIST3: List of categorical features (i.e., binomial/multinomial) 4) LIST4: List of unstructured features (e.g., medicine taken with a dose spec and typoe of release) 5) LIST5: List of numerical features (both integer and double, on which to calculate distances and entropy)
List2 of DATE features can be automated (as long as DATE is in the name). For example, to check which columns are dates, we can use the following statement in the code: sapply(pilot_data.complete, function(x) !all(is.na(as.Date(as.character(x),format=“%d/%m/%Y”))))
List3 of categorical features is now automated (still more tests are recommended).
List4 of unstructured data can be automated (not implemented yet, list provided as input).
# This is an example of a list of features to remove (point 1) above).
list1=c("ENCRYPTED_MRN","LAST_ADHD_MED_NAME", "FIRST_ADHD_MED_NAME", "ID", "GROUP",
"wt_flag_ivo","ht_flag_ivo","bmi_flag_ivo","new_bmi_flag_ivo","HEIGHT_FLAG",
"WEIGHT_FLAG","BMI_FNL","BMI","bmiold")
# Since BMI = Weight/[Height^2], weight in Kg and Height in meter, we want to list any BMI feature in LIST1 ,
# This is where I remove the features in LIST1
pilot_data.cl <- pilot_data.complete[ , -which(names(pilot_data.complete) %in% list1)]
# This is how we retrieve a list of features that represents dates. Note that the only check here is to look
# for the word DATE in the names and then make the format of the features as.Date, matching the
# format = "%m/%d/%Y" and recasting it as "%Y-%M-%D" according to as.Date (e.g., 3/20/2014 --> 2014-03-20)
pilot_data.cl [, cols <- grepl("DATE", names(pilot_data.cl))] <-
lapply(pilot_data.cl[, cols <- grepl("DATE",names(pilot_data.cl))], as.Date, format = "%m/%d/%Y")
# This step can be part of the Sifter, i.e. convert Date --> to Year ONLY
# For this type conversion (coercion), the left hand side of the assign (pilot_data.cl[sapply(.)])
# subsets the columns that are either not-numeric or not-logical.
# The right hand side, for that subset, uses lapply to perform the conversion you need to do.
# R is smart enough to replace the original columns with the results.
pilot_data.cl [, cols <- grepl("DATE", names(pilot_data.cl))] <-
lapply(pilot_data.cl[, cols <- grepl("DATE",names(pilot_data.cl))], year)
## This is the list of date features (point 2) above).
list2_cols=which(grepl("DATE",names(pilot_data.cl)))
# and these are the feature/column names
list2=names(pilot_data.cl)[which(grepl("DATE",names(pilot_data.cl)))]
# Set the DATE features to integers (still numeric)
pilot_data.cl[,list2_cols] <- lapply(pilot_data.cl[,list2_cols],as.integer)
## This is the list of unstructured features
list4=c("MED_NAME","DEP_MED_NAME")
list4_cols=(which(grepl("NAME",names(pilot_data.cl))))
# This converts all the unstructured data to string
pilot_data.cl[,list4_cols] <- lapply(pilot_data.cl[,list4_cols],toString)
## This is the list of categorical features (point 3) above).
# This is where we automatically select factors (categorical features)
# By default, list1 is already discarded, list2 (DATES) is integer and list 4 is string.
# So, the is.factor() function will not select any feature from list1, list2 and list4,
# by construction.
list3<-which(sapply(pilot_data.cl,is.factor)==TRUE)
list3_names <- names(list3)
list3_cols <- which(names(pilot_data.cl) %in% list3_names)
# This module will assign levels automatically (as integers) according each categorical
# feature levels.For example, in feature 3 has 4 levels, it will be recasted as 1,2,3 and 4.
# We can add a max # of levels above which the feature will be considered not categorical.
category_new <- matrix(nrow = dim(pilot_data.cl)[1], ncol = dim(pilot_data.cl)[2])
for (j in list3_cols)
{
##
# This is where We can add a check for max # of levels above which the feature will be
# considered not categorical.
# if length(levels(pilot_data.cl[,j]))
##
category_temp <- NULL
for (i in 1:length(levels(pilot_data.cl[,j])))
{
rows <- which(pilot_data.cl[,j] == levels(pilot_data.cl[,j])[i])
category_temp[rows] <-i
}
category_new[,j] <- category_temp
#pilot_data.cl_[cols,j] <- category_temp[cols]
}
pilot_data.cl[,list3_cols] <- category_new[,list3_cols]
pilot_data.cl[,list3_cols] <- lapply(pilot_data.cl[,list3_cols],as.factor)
# This is a check for max # of levels above which the feature will be
# considered not categorical.
list3_NotCateg_cols <- NULL
for (j in list3_cols)
{
if(length(unique(levels(pilot_data.cl[,j])))>floor((3*log10(dim(pilot_data.cl)[1]))))
{
list3_NotCateg_cols <- c(list3_NotCateg_cols,j)
}
}
# Updated list3 (without the fatures classified above as non categorical)
# It checks if the non categorical is empty. If not, it updates list3
ifelse(is.null(list3_NotCateg_cols),"No categorical features above threshold",
list3_cols <- list3_cols[-c(which(list3_cols %in% list3_NotCateg_cols))])
## [1] "No categorical features above threshold"
# This step defines the non categorical as numeric
pilot_data.cl[,list3_NotCateg_cols] <- as.numeric(pilot_data.cl[,list3_NotCateg_cols])
list3_NotCateg_names <- names(list3_NotCateg_cols)
# This step updates list3 and list3 names
list3<-which(sapply(pilot_data.cl,is.factor)==TRUE)
list3_names <- names(list3)
list3_cols <- which(names(pilot_data.cl) %in% list3_names)
This step can calculate a variety of dissimilarity or distance metrics between cases/subjects. The function distance() in the ecodist package is written for extensibility and understandability, and it may not be efficient for use with large matrices. Initially, only select the numeric features (which includes integer such as dates, or total count of medicines). Later we may need to expand this to select strings. The default distance is the Gray-Curtis distance (more stable/fewer singularities). The results of distance() is a lower-triangular distance matrix as an object of class “dist” with size ([# of cases]*[# of cases - 1])/2.
# pilot_data.md <- distance(pilot_data.cl[,1:31], "mahal") # Mahalanobis distance
# pilot_data.bc <- distance(pilot_data.cl[,1:31]) # default: Gray-Curt distance
# Plot the Mahalanobis and Bray-Curtis dissimilarities
#pilot_data.md <- distance(pilot_data.cl[sapply(pilot_data.cl, is.numeric)], "mahal") # Mahalanobis distance
# This step calculates the distance only for numeric features
pilot_data.bc <- distance(pilot_data.cl[sapply(pilot_data.cl, is.numeric)])
# To retrieve which subjects are closest (i.e., more similar to each other),
# first we determine the list of possible combinations,
comb_subjects=combn(dim(pilot_data.cl)[1],2)
# then we generate a data frame with the ID of the distances,
df_bc_dist <- as.data.frame(cbind(seq(1,length(comb_subjects)/2,1),pilot_data.bc))
names(df_bc_dist) <- c("Case","Distance")
# and we order the BC metrics,
df_bc_dist_temp <- df_bc_dist[order(df_bc_dist$Distance),]
# We then select a subset to swap unstructured data on, based on the criteria below.
# Below we can substitute which(df_bc_dist_temp$Distance < 0.25)
# Or just take a % (e.g., top 10%).
# As an example I show the identical cases (i.e., distance == 0) and
# similar cases (i.e., distance between 0 and 1)
case_identical <- which(df_bc_dist_temp$Distance == 0)
case_similar <- which(df_bc_dist_temp$Distance > 0 & df_bc_dist_temp$Distance < 1)
First component of the slider - k0: % of similar cases on which to swap unstructured feature values from list4. I am thinking of 5 options: c(0,0.1,0.2,0.3,0.4).
## Here I set all the components of the Sifter Slider
slider_weights <- c(9.97,1.109,0.114,100.989)
k0_list <- slider_weights[1]*c(0,0.1,0.2,0.3,0.4)
k1_list <- slider_weights[2]*c(0,0.1,0.2,0.3,0.4)
k2_list <- slider_weights[3]*c(0,1,2,3,4)
k3_list <- slider_weights[4]*c(0,0.1,0.2,0.3,0.4)
k0 <- 0.4 # k0_list <- c(0,0.1,0.2,0.3,0.4)
k1 <- 0.4 # k1_list <- c(0,0.1,0.2,0.3,0.4)
k2 <- 4 # k2_list <- c(0,1,2,3,4)
k3 <- 0.4 # k3_list <- c(0,0.1,0.2,0.3,0.4)
K <- list(k0_list,k1_list,k2_list,k3_list)
K_comb <- expand.grid(K)
for (i in 1:dim(K_comb)[1])
{
K_comb[i,4] <- sum(K_comb[i,1:4])
}
names(K_comb) <- c("k0","k1","k2","Combined")
K_comb_ordered <- K_comb[order(K_comb$Combined),]
length(K_comb_ordered$Combined)
## [1] 625
length(unique(K_comb_ordered$Combined))
## [1] 625
K_comb_ordered$SliderValue <- sort(max(K_comb_ordered$Combined) - K_comb_ordered$Combined) /max(K_comb_ordered$Combined)
k_list <- c(k0,k1,k2,k3)
slider <- sum(slider_weights * k_list)
sliderValue <- K_comb_ordered$SliderValue[which(K_comb_ordered$Combined == slider)]
print("In this example the slider is set to")
## [1] "In this example the slider is set to"
print(sliderValue)
## [1] 1
if (k0 == 0)
{
pilot_data.cl_k0 <- pilot_data.cl
} else {
case_identical_subset <- sample(case_identical,round(length(case_identical)*k0))
case_similar_subset <- sample(case_similar,round(length(case_similar)*k0))
# Swapping module
A <- comb_subjects[,df_bc_dist_temp$Case[case_similar_subset]]
B<-pilot_data.cl
#B[,list4_cols]
for (i in 1:length(case_similar_subset))
{
a1 <- B[A[,i][1],list4_cols]
a2 <- B[A[,i][2],list4_cols]
B[A[,i][1],list4_cols] <- a2
B[A[,i][2],list4_cols] <- a1
#readline("Press <return to continue")
}
## Dataset with the first filter k0 applied (swap of unstructured features between similar cases)
pilot_data.cl_k0 <- B
}
This next step loops through two conditions of the DataSifter slider, namely: k1: introduction of % of artificial missing values + imputation. I am thinking of 5 options: c(0,0.10,0.20,0.30,0.40) k2: how many time to repeat k1. I am thinking of 5 options: c(0,1,2,3,4)
# Before performing the k1-k2 slider conditions, we neglect the unstructured features (list4)
# and the dates (list2)
list5_cols <- 1:length(pilot_data.cl_k0)
#list5_cols <- list5_cols[-c(list2_cols,list4_cols)]
list5_cols <- list5_cols[-c(list4_cols)]
C <- pilot_data.cl_k0[,list5_cols]
for (j in 1:k2)
{
# Here I introduce k1 missing data
C <- prodNA(C, noNA = k1)
# Here I impute with the function missForest
C <- missForest(C, maxiter = 1)
C <- C$ximp
}
## missForest iteration 1 in progress...done!
## missForest iteration 1 in progress...done!
## missForest iteration 1 in progress...done!
## missForest iteration 1 in progress...done!
## Dataset with the second and third filter k1 and k2 applied (see above)
pilot_data.cl_k2 <- pilot_data.cl_k0
pilot_data.cl_k2[,list5_cols] <- C
pilot_data.cl_k2[,c(list2_cols,list4_cols)] <- as.data.frame(pilot_data.cl_k0[,c(list2_cols,list4_cols)])
# Recast raw and k2 sifted data with the same integer features
a <- which(lapply(pilot_data.cl,is.integer)==TRUE)
for (i in 1:length(a))
{
pilot_data.cl_k2[,a[i]] <- as.integer(pilot_data.cl_k2[,a[i]])
}
#cbind(pilot_data.cl[1:20,which(lapply(pilot_data.cl.obfusc,is.double) == TRUE)],pilot_data.cl.obfusc[1:20,which(lapply(pilot_data.cl.obfusc,is.double) == TRUE)])
## This step rounds features that are supposed to be integers (I need to change it from the time
## I generate the data frame and set these features to integers)
#pilot_data.cl_k2[,7:14] <- round(pilot_data.cl_k2[,7:14])
list5<-which(sapply(pilot_data.cl_k2,is.numeric)==TRUE)
# Here I generate the empirical distribution for each of the numeric features
# If the unique values of an integer feature are few but > 3*log10(# of cases),
# It is recommended to simply randomly pick a value among the unique values (not implemented yet),
# otherwise the swapping is uneffective.
empirical = sapply(pilot_data.cl_k2[,list5],ecdf)
pilot_data.cl.ecdf <- pilot_data.cl_k2
pilot_data.cl.ecdf_sorted <- pilot_data.cl_k2
# Here I go through the list of numeric features and generate the empirical distribution
# The data frame pilot_data.cl.ecdf has the empirical distribution of all the numeric features
for (i in 1:length(list5))
{
pilot_data.cl.ecdf_sorted[,list5[i]] <- sort(pilot_data.cl_k2[,list5[i]])
eval(parse(text=paste0("f = empirical$",names(list5)[i],collapse = '',sep = '',"(pilot_data.cl.ecdf_sorted[,list5[i]])")))
pilot_data.cl.ecdf[,list5[i]] <- f
}
# Here I sample feature to obfuscate - k3. I am thinking of 6 options
if(k3 == 0.0) # this checks if no obfuscation is required
{
pilot_data.cl.obfusc <- pilot_data.cl_k2
list_cases_obfusc <- dim(pilot_data.cl_k2)[1]
list_feature_obfusc <- dim(pilot_data.cl_k2)[2]
} else {
list_feature_obfusc <- sample(list5,round(k3*length(list5)))
list_cases_obfusc <- sample(1:dim(pilot_data.cl_k2)[1],round(k3*dim(pilot_data.cl_k2)[1]))
# This will be the new set of data after obfuscation
pilot_data.cl.obfusc <- pilot_data.cl_k2
for (j in 1:length(list_cases_obfusc))
{
for (i in 1:length(list_feature_obfusc))
{
s=runif(1)
# print("Random value for ecdf")
# print(s)
# This is where is select the first time [t] the feature in the ecdf is greater than the sample s
t = which(pilot_data.cl.ecdf[,list_feature_obfusc[i]] > s)[1]
# print("index of t")
# print(t)
# This is where I replace/obfuscate.
# First I set a equal to the feature value correspondent to t.
a <- pilot_data.cl.ecdf_sorted[t,list_feature_obfusc[i]]
# Then I need to know if the feature is a double or an integer.
# In the process, I do a nested ifelse. The inner ifelse checks if t is equal to 1 (i.e., if it is
# the first element in the ecdf. If yes, we can only pick the first element, both in the case of
# integer or double). The outer ifelse, checks if the feature is a double.
# If the feature is a double, and t is not 1, then a random uniform value between the values of the feature
# at t and at t-1 is sampled.
# If the feature is an integer, a Bernoulli sample is done between the values of the feature
# at t and at t-1.
#print(pilot_data.cl[list_cases_obfusc[j],list_feature_obfusc[i]])
ifelse(is.double(pilot_data.cl_k2[,list_feature_obfusc[i]]),
ifelse(t == 1, c <- runif(1, min = a, max = a),
c <- runif(1,min = pilot_data.cl.ecdf_sorted[t-1,list_feature_obfusc[i]],max = a)),
ifelse(t == 1, c <- runif(1, min = a, max = a),
# This add or subtract 1 to the t-1 element
#c <- pilot_data.cl.ecdf_sorted[t-1,list_feature_obfusc[i]] + sample(c(-1,0,1),1,0.33)))
# This sample a value between t-1 an t element
c <- sample(c(pilot_data.cl.ecdf_sorted[t-1,list_feature_obfusc[i]],a),1,0.5)))
pilot_data.cl.obfusc[list_cases_obfusc[j],list_feature_obfusc[i]] <- c
}
}
}
diff1 <- NULL
diff1 <- sum(abs(pilot_data.cl[list_cases_obfusc,list_feature_obfusc] - pilot_data.cl.obfusc[list_cases_obfusc,list_feature_obfusc]))
print("Distance Sifted Data from Raw Data")
## [1] "Distance Sifted Data from Raw Data"
print(diff1)
## [1] 6526
Now the validation step will compare the distributions of the raw dataset (in this example pilot_data.cl) with the distributions of the sifted dataset (in this example pilot_data.cl_k2) by using violin/bean plots.
Raw_Data <- cbind(pilot_data.cl,1)
Sifted_Data <- cbind(pilot_data.cl.obfusc,2)
#Sifted_Data <- cbind(pilot_data.cl_k2,2)
names(Raw_Data)[length(names(Raw_Data))] <- names(Sifted_Data)[length(names(Sifted_Data))] <- c("Protocol")
#Data=rbind(Raw_Data,Sifted_Data)
# Violin plots without dates
# Data=rbind(Raw_Data[,-c(list2_cols,list3,list4_cols)],Sifted_Data[,-c(list2_cols,list3_cols,list4_cols)])
# Violin plots with dates
Data=rbind(Raw_Data[,-c(list3,list4_cols)],Sifted_Data[,-c(list3_cols,list4_cols)])
for (j in 1:(length(Data)-1))
{
beanplot(Data[,j] ~ Data$Protocol, ll = 0.04,
main = "Data Sifter Pilot Study", side = "both", xlab="Features",
col = list("purple", c("lightblue", "black")),
axes=F)
axis(1, at=c(1), labels=c(names(Data)[j]))
axis(2)
legend("bottomright", fill = c("purple", "lightblue"),
legend = c("Raw Data", "Sifted Data"), box.lty=0)
#readline("Press <return to continue")
}
glm.raw <- glm(pilot_data.cl$HEIGHT ~ pilot_data.cl$WEIGHT + pilot_data.cl$FIRST_OSA_DATE + pilot_data.cl$CURR_AGE + pilot_data.cl$BIRTH_DATE + pilot_data.cl$FIRST_ADHD_DATE + pilot_data.cl$FIRST_OSA_DATE + pilot_data.cl$LAST_ADHD_MED_DATE + pilot_data.cl$DEP_MED_DATE + pilot_data.cl$FIRST_ADHD_MED_DATE +pilot_data.cl$CURR_AGE + pilot_data.cl$FIRST_ADHD_AGE + pilot_data.cl$FIRST_OSA_AGE + pilot_data.cl$ENCOUNTER_DATE + pilot_data.cl$VITALS_DATE + pilot_data.cl$MED_DATE + pilot_data.cl$count_total_meds)
glm.obfusc <- glm(pilot_data.cl.obfusc$HEIGHT ~ pilot_data.cl.obfusc$WEIGHT + pilot_data.cl.obfusc$FIRST_OSA_DATE + pilot_data.cl.obfusc$CURR_AGE + pilot_data.cl.obfusc$BIRTH_DATE + pilot_data.cl.obfusc$FIRST_ADHD_DATE + pilot_data.cl.obfusc$FIRST_OSA_DATE + pilot_data.cl.obfusc$LAST_ADHD_MED_DATE + pilot_data.cl.obfusc$DEP_MED_DATE + pilot_data.cl.obfusc$FIRST_ADHD_MED_DATE +pilot_data.cl.obfusc$CURR_AGE + pilot_data.cl.obfusc$FIRST_ADHD_AGE + pilot_data.cl.obfusc$FIRST_OSA_AGE + pilot_data.cl.obfusc$ENCOUNTER_DATE + pilot_data.cl.obfusc$VITALS_DATE + pilot_data.cl.obfusc$MED_DATE + pilot_data.cl.obfusc$count_total_meds)
glm.comparison <- cbind(glm.raw$coefficients,glm.obfusc$coefficients)
print(glm.comparison)
## [,1] [,2]
## (Intercept) -9844.0484599 -709.04076766
## pilot_data.cl$WEIGHT 0.2455919 0.41039594
## pilot_data.cl$FIRST_OSA_DATE 0.8793959 0.12534197
## pilot_data.cl$CURR_AGE 5.0103448 0.05447324
## pilot_data.cl$BIRTH_DATE -1.8360876 -1.97800440
## pilot_data.cl$FIRST_ADHD_DATE 3.0731381 0.21476593
## pilot_data.cl$LAST_ADHD_MED_DATE 0.5151427 0.45850767
## pilot_data.cl$DEP_MED_DATE -1.9281422 -0.80040521
## pilot_data.cl$FIRST_ADHD_MED_DATE 1.2258863 0.88419671
## pilot_data.cl$FIRST_ADHD_AGE -2.2085329 0.24659916
## pilot_data.cl$FIRST_OSA_AGE -1.2400902 0.06759153
## pilot_data.cl$ENCOUNTER_DATE 0.2661212 2.05264858
## pilot_data.cl$VITALS_DATE 4.1031610 -0.41378455
## pilot_data.cl$MED_DATE -1.3652251 -0.14174429
## pilot_data.cl$count_total_meds 0.6779039 0.47215164