df1 = read.csv("C:\\Users\\luzya\\OneDrive\\Escritorio\\PhD program\\Fall 2024\\DS 8998 - Farhana\\df_ex2.csv")
df2 = read.csv("C:\\Users\\luzya\\OneDrive\\Escritorio\\PhD program\\Fall 2024\\DS 8998 - Farhana\\df_ex1.csv")
head(df2, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
id Zip_Code sex race age education hourwage
ID0008 23221 Female White 30 Associate’s degree, occupational/vocational program 11.49
ID0009 23460 Male White 33 Associate’s degree, occupational/vocational program 14.50
ID0077 22153 Male White 60 High school diploma or equivalent 9.00
ID0080 22303 Female White 41 High school diploma or equivalent 7.50
ID0081 22181 Female White 17 Grade 11 7.25
ID0083 24015 Female White 31 Associate’s degree, academic program 10.42
ID0084 23832 Male White 31 High school diploma or equivalent 13.00
ID0088 23603 Male White 53 Grade 11 12.00
ID0089 23323 Male White 45 Some college but no degree 28.75
ID0090 22303 Female White 38 Professional school degree 13.00
# Identifying distinct entries for the variables
distinct_entries <- df2 %>%
  summarise(
    age = n_distinct(age),
    education = n_distinct(education),
    sex = n_distinct(sex),
    race = n_distinct(race),
    Zip_Code = n_distinct(Zip_Code)
  )

distinct_entries
##   age education sex race Zip_Code
## 1  67        16   2    2      226

Exploratory Data Analysis

ggplot(df2,aes(x=hourwage)) +
  geom_histogram(color="Black",fill="skyblue")+
  labs(title = 'Histogram of Hourly Wage', x = 'Hourly Wage', y = 'Frequency')+
  theme(plot.title = element_text(hjust=0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

race_counts <- table(df2$race)

# Calculate percentages
race_percentages <- prop.table(race_counts) * 100

# Combine counts and percentages into a data frame for better readability
race_table <- data.frame(
  Count = race_counts,
  Percentage = sprintf("%.2f%%", race_percentages)  # Format as percentage
)

print("Table for Race Distribution:")
## [1] "Table for Race Distribution:"
print(race_table)
##   Count.Var1 Count.Freq Percentage
## 1      Black      26113     12.41%
## 2      White     184338     87.59%
library(table1)
table1::label(df2$age) <- "Age"
table1::label(df2$hourwage) <- "Hourly Wage"
table1::label(df2$education) <- "Education Level"
table1::label(df2$sex) <- "Sex"

table1::table1(~ age + hourwage + sex + education | race, data = df2)
Black
(N=26113)
White
(N=184338)
Overall
(N=210451)
Age
Mean (SD) 40.8 (14.7) 40.8 (15.4) 40.8 (15.3)
Median [Min, Max] 40.0 [15.0, 85.0] 40.0 [15.0, 85.0] 40.0 [15.0, 85.0]
Hourly Wage
Mean (SD) 17.3 (9.33) 19.6 (11.0) 19.3 (10.8)
Median [Min, Max] 15.0 [1.00, 100] 16.8 [0.0100, 100] 16.5 [0.0100, 100]
Sex
Female 14696 (56.3%) 92721 (50.3%) 107417 (51.0%)
Male 11417 (43.7%) 91617 (49.7%) 103034 (49.0%)
Education Level
12th grade, no diploma 515 (2.0%) 2716 (1.5%) 3231 (1.5%)
Associate's degree, academic program 1784 (6.8%) 13227 (7.2%) 15011 (7.1%)
Associate's degree, occupational/vocational program 1244 (4.8%) 10943 (5.9%) 12187 (5.8%)
Bachelor's degree 3296 (12.6%) 30087 (16.3%) 33383 (15.9%)
Doctorate degree 130 (0.5%) 988 (0.5%) 1118 (0.5%)
Grade 10 439 (1.7%) 3959 (2.1%) 4398 (2.1%)
Grade 11 870 (3.3%) 5637 (3.1%) 6507 (3.1%)
Grade 9 240 (0.9%) 2986 (1.6%) 3226 (1.5%)
Grades 1, 2, 3, or 4 67 (0.3%) 955 (0.5%) 1022 (0.5%)
Grades 5 or 6 93 (0.4%) 2192 (1.2%) 2285 (1.1%)
Grades 7 or 8 163 (0.6%) 2172 (1.2%) 2335 (1.1%)
High school diploma or equivalent 9985 (38.2%) 62735 (34.0%) 72720 (34.6%)
Master's degree 1015 (3.9%) 7264 (3.9%) 8279 (3.9%)
None or preschool 47 (0.2%) 371 (0.2%) 418 (0.2%)
Professional school degree 96 (0.4%) 775 (0.4%) 871 (0.4%)
Some college but no degree 6129 (23.5%) 37331 (20.3%) 43460 (20.7%)

Defining and Measuring Privacy

k-anonymity

K-anonymity ensures that each individual in a dataset cannot be distinguished from at least k-1 other individuals with respect to the quasi-identifiers in the dataset. This is done through grouping and suppression. Applying k-anonymity makes it more difficult for an attacker to re-identify specific individuals in the dataset.

Vocabulary

  • Identifiers (also known as key attributes): direct identifiers such as names, student numbers, email addresses, etc. These variables should in principle not be collected at all, or removed from the dataset if they are not necessary for your research project.

  • Quasi-identifiers: indirect identifiers that can lead to identification when combined with other quasi-identifiers in the dataset or external information. These are often demographic variables like age, sex, place of residence, etc., but could also be something entirely different like physical characteristics, timestamps, etc. In general, quasi-identifiers are usually variables that are likely to be known to someone in the outside world.

  • Sensitive attributes: variables of interest which should be protected, and which cannot be changed, because they are the main outcome variables. For example, it can be Medical condition in a healthcare dataset, or Income in a financial dataset.

It is important to correctly categorise the variables in your dataset as any of these variable types if you want to apply k-anonymity because they will determine how the dataset will be de-identified. To make a dataset k-anonymous, you must first identify which variables in the dataset are identifiers, quasi-identifiers, and sensitive attributes.

In the example above, we have that

  • Identifiers: id
  • Quasi-identifiers: zip_code, age, sex, race, and education
  • Sensitive attributes: hourwage

Next, you should set a value for k. If we choose a k of 2, every row in the dataset should have the same combination of quasi-identifiers as at least 1 other row in the dataset. Finally, you aggregate the dataset so that every combination of quasi-identifiers occurs at least k times.

Suppressing, Grouping, and Perturbing

Step-by-Step Implementation

Supressing

# Applying suppression by removing unique identifier
df2$id <- NULL

head(df2, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
Zip_Code sex race age education hourwage
23221 Female White 30 Associate’s degree, occupational/vocational program 11.49
23460 Male White 33 Associate’s degree, occupational/vocational program 14.50
22153 Male White 60 High school diploma or equivalent 9.00
22303 Female White 41 High school diploma or equivalent 7.50
22181 Female White 17 Grade 11 7.25
24015 Female White 31 Associate’s degree, academic program 10.42
23832 Male White 31 High school diploma or equivalent 13.00
23603 Male White 53 Grade 11 12.00
23323 Male White 45 Some college but no degree 28.75
22303 Female White 38 Professional school degree 13.00

Grouping

df3 = df1

df3$age_group <- cut(df1$age, breaks=c(10, 20, 30, 40, 50, 60, 70, 80, 90), right=FALSE, labels=c("10s", "20s", "30s", "40s", "50s", "60s", "70s", "80s"))

df3$age <- NULL

df3$Zip_Code <- NULL

df3$id <- NULL

# Specify a new order where "hourwage" and "zip_code" are swapped
new_order <- c("County", "sex", "race", "age_group", "education", "hourwage")

# Reorder the dataframe columns according to the new order
df3 <- df3[, new_order]

# Apply this method after modelling (otherwise is classification)
#max_wage <- ceiling(max(df1$hourwage) / 5) * 5  # This rounds up to the nearest 5
#df1$wage_group <- cut(df1$hourwage, breaks=seq(0, max_wage, by=5), include.lowest=TRUE)
head(df4, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
County sex race age_group education_group hourwage
Richmond Female White 30s Associate Degree or more 11.49
Virginia Beach Male White 30s Associate Degree or more 14.50
Springfield Male White 60s High school diploma or equivalent 9.00
Alexandria Female White 40s High school diploma or equivalent 7.50
Vienna Female White 10s Less than High School diploma 7.25
Roanoke Female White 30s Associate Degree or more 10.42
Chesterfield Male White 30s High school diploma or equivalent 13.00
Newport News Male White 50s Less than High School diploma 12.00
Chesapeake Male White 40s Some college but no degree 28.75
Alexandria Female White 30s Associate Degree or more 13.00

Perturbing

Perturbing is a privacy-preserving technique that protects data by adding small amounts of random noise or modifying values in a controlled way. The goal of perturbing is to obscure individual data points, making it harder to re-identify individuals while retaining the dataset’s overall statistical properties.

In our dataset, this could involve adding or subtracting a small amount to the age variable instead of grouping. Since we have already performed generalization through grouping, we will not add perturbation to this variable.

table1::label(df4$age_group) <- "Age Group"
table1::label(df4$hourwage) <- "Hourly Wage"
table1::label(df4$education_group) <- "Education Group"
table1::label(df4$sex) <- "Sex"

table1::table1(~ age_group + hourwage + sex + education_group | race, data = df4)
Black
(N=26113)
White
(N=184338)
Overall
(N=210451)
Age Group
10s 1315 (5.0%) 11807 (6.4%) 13122 (6.2%)
20s 6001 (23.0%) 43024 (23.3%) 49025 (23.3%)
30s 5529 (21.2%) 37103 (20.1%) 42632 (20.3%)
40s 5219 (20.0%) 33303 (18.1%) 38522 (18.3%)
50s 4913 (18.8%) 33679 (18.3%) 38592 (18.3%)
60s 2531 (9.7%) 20418 (11.1%) 22949 (10.9%)
70s 509 (1.9%) 4288 (2.3%) 4797 (2.3%)
80s 96 (0.4%) 716 (0.4%) 812 (0.4%)
Hourly Wage
Mean (SD) 17.3 (9.33) 19.6 (11.0) 19.3 (10.8)
Median [Min, Max] 15.0 [1.00, 100] 16.8 [0.0100, 100] 16.5 [0.0100, 100]
Sex
Female 14696 (56.3%) 92721 (50.3%) 107417 (51.0%)
Male 11417 (43.7%) 91617 (49.7%) 103034 (49.0%)
Education Group
Associate Degree or more 7565 (29.0%) 63284 (34.3%) 70849 (33.7%)
High school diploma or equivalent 9985 (38.2%) 62735 (34.0%) 72720 (34.6%)
Less than High School diploma 2434 (9.3%) 20988 (11.4%) 23422 (11.1%)
Some college but no degree 6129 (23.5%) 37331 (20.3%) 43460 (20.7%)
# Identifying distinct entries for the variables
distinct_entries <- df4 %>%
  summarise(
    age_group = n_distinct(age_group),
    education = n_distinct(education_group),
    sex = n_distinct(sex),
    race = n_distinct(race),
    County = n_distinct(County)
  )

distinct_entries
##   age_group education sex race County
## 1         8         4   2    2     52

Observations:

  • The age group is now represented by only 7 ranges, instead of being a continuous variable. This reduces the number of distict ages from 67 to 7.

  • The education variable now has only 4 levels, instead of 16.

  • The Zip_Code variable had 226 distinct values. After generalization, this number got reduced to 52.

In k-anonymity, each unique combination of quasi-identifiers must appear at least k times in the dataset to ensure that individuals cannot be easily re-identified. The data frame below shows each distinct set of quasi-identifiers, along with a count indicating how many records share that exact combination. This dataframe tell us that there are 5,773 possible combinations of quasi-identifiers in our dataset.

# Count the number of occurrences of each combination of quasi-identifiers
combinations <- df4 %>%
  group_by(County, sex, race, age_group, education_group) %>%
  summarise(count = n(), .groups = "drop")

# Display combinations and their counts
head(combinations, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
County sex race age_group education_group count
Abingdon Female Black 10s High school diploma or equivalent 2
Abingdon Female Black 10s Less than High School diploma 1
Abingdon Female Black 10s Some college but no degree 2
Abingdon Female Black 20s Associate Degree or more 6
Abingdon Female Black 20s High school diploma or equivalent 15
Abingdon Female Black 20s Less than High School diploma 1
Abingdon Female Black 20s Some college but no degree 8
Abingdon Female Black 30s Associate Degree or more 8
Abingdon Female Black 30s High school diploma or equivalent 11
Abingdon Female Black 30s Less than High School diploma 2

Out of those possible combinations, we wish to find how many combinations do not meet k-anonymality. To do this, we filter the dataset to only include the rows which count values are less than or equal to \(k=2\).

k = 2

# Find combinations that do not meet k-anonymity (where count < k)
unique_combinations <- combinations %>%
  filter(count < k)

head(unique_combinations, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
County sex race age_group education_group count
Abingdon Female Black 10s Less than High School diploma 1
Abingdon Female Black 20s Less than High School diploma 1
Abingdon Female Black 70s Associate Degree or more 1
Abingdon Female Black 70s High school diploma or equivalent 1
Abingdon Female Black 80s High school diploma or equivalent 1
Abingdon Female Black 80s Less than High School diploma 1
Abingdon Female White 70s Less than High School diploma 1
Abingdon Female White 80s Associate Degree or more 1
Abingdon Male Black 30s Less than High School diploma 1
Abingdon Male Black 50s Some college but no degree 1

After filtering out the data to include only rows with count values less than 2, we notice that 573 rows, or about 10% of our data, does not satisfy k-anonymality. Thus, we applied additional generalization techniques in the dataset below. Instead of using 10-year age intervals, we broadened the age groups to 20-year ranges. Additionally, we reduced the geographical detail by grouping individuals into 6 broader regions (Northern Virginia, Central Virginia, Eastern Virginia, Western Virginia, Southwest Virginia, and Washington), rather than the original 52 distinct counties.

# Load dplyr
library(dplyr)

# Apply regional generalization
df5 <- df4 %>%
  mutate(Region = case_when(
    County %in% c("Alexandria", "Arlington", "Centreville", "Chantilly", "Fairfax", "Falls Church", 
                  "Herndon", "Leesburg", "Mc Lean", "Manassas", "Reston", "Sterling", 
                  "Vienna", "Warrenton", "Woodbridge") ~ "Northern Virginia",
    County %in% c("Charlottesville", "Chester", "Chesterfield", "Fredericksburg", "Glen Allen", 
                  "Henrico", "Mechanicsville", "Midlothian", "Richmond", "Spotsylvania", 
                  "Stafford") ~ "Central Virginia",
    County %in% c("Chesapeake", "Hampton", "Newport News", "Norfolk", "Portsmouth", 
                  "Suffolk", "Virginia Beach", "Williamsburg", "Yorktown") ~ "Eastern Virginia",
    County %in% c("Blacksburg", "Harrisonburg", "Lynchburg", "Winchester") ~ "Western Virginia",
    County %in% c("Abingdon", "Bristol", "Danville", "Farmville", "Radford") ~ "Southwest Virginia",
    County == "Washington" ~ "Washington",
    TRUE ~ "Unknown"
  ))

df5$County <- NULL

# Specify a new order where "hourwage" and "zip_code" are swapped
new_order <- c("Region", "sex", "race", "age_group", "education_group", "hourwage")


# Generalize age to 20-year intervals
df5 <- df5 %>%
  mutate(age_group = case_when(
    age_group %in% c("10s", "20s") ~ "10-29",
    age_group %in% c("30s", "40s") ~ "30-49",
    age_group %in% c("50s", "60s") ~ "50-69",
    age_group %in% c("70s", "80s") ~ "70-89",
    TRUE ~ age_group  # Catch any other categories, if needed
  ))

# Reorder the dataframe columns according to the new order
df5 <- df5[, new_order]
# Count the number of occurrences of each combination of quasi-identifiers
combinations <- df5 %>%
  group_by(Region, sex, race, age_group, education_group) %>%
  summarise(count = n(), .groups = "drop")

# Sort the combinations dataset by the 'count' column in ascending order
combinations_sorted <- combinations %>%
  arrange(count)

# Display the sorted dataset
head(combinations_sorted, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
Region sex race age_group education_group count
Southwest Virginia Male Black 70-89 Associate Degree or more 1
Southwest Virginia Male Black 70-89 Less than High School diploma 1
Washington Female Black 70-89 High school diploma or equivalent 1
Washington Female Black 70-89 Less than High School diploma 1
Washington Male Black 70-89 High school diploma or equivalent 1
Western Virginia Female Black 70-89 High school diploma or equivalent 1
Southwest Virginia Female Black 70-89 Less than High School diploma 2
Southwest Virginia Female Black 70-89 Some college but no degree 2
Unknown Male Black 70-89 Less than High School diploma 2
Washington Male Black 10-29 Less than High School diploma 2

Challenges & Limitations

Despite efforts to achieve k-anonymity through suppression and generalization in our example, the dataset still contains six unique combinations of quasi-identifiers that fail to meet the k-anonymity requirement. Ideally, each combination of Region, sex, race, age_group, and education_group should ideally occur at least \(k=2\) times to prevent re-identification of individuals. However, the first six rows have a count of \(k-1\), indicating that these combinations are still too specific and make individuals identifiable. This introduces the following challenge:

  • High Granularity: Our dataset exhibits high granularity, meaning that the data is broken down into highly detailed and specific combinations of quasi-identifiers. This level of detail makes it extremely challenging to group rows into equivalence classes that satisfy k-anonymity. Even after generalizing the data further by broadening age groups and simplifying location details, the variety of attributes such as region or education level still resulted in six unique combinations. These unique rows, which constitute only 0.1% of the data, highlight a critical limitation: achieving perfect k-anonymity for every record can be exceptionally difficult in datasets with significant variability and diverse populations. Because of this, one could argue that the dataset achieves a practical level of k-anonymity, as the non-compliant portion is quite small.

  • Imbalance and Minority Representation: Another significant challenge comes from the representation of minority groups in the dataset. For example, a substantial portion of the quasi-identifier combinations that fail to meet k-anonymity are associated with individuals of the Black race. This is because, as a minority group, there are fewer instances of Black individuals in the dataset, making it less likely for them to share the same set of quasi-identifiers as others. Consequently, these individuals are more prone to being identifiable even after generalization and suppression.

  • The Trade-Off Dilemma: Let’s consider that we are still not satisfied and want our entire dataset to fully satisfy k-anonymity. Achieving this would require even more extensive suppression and generalization. For instance, we might need to group individuals into broader age ranges spanning 30 years and simplify educational levels to just two categories: “More than a Bachelor’s Degree” and “Bachelor’s Degree or Less.” While this would reduce the number of unique equivalence classes and make it easier to satisfy k-anonymity, it comes at a cost. However, if too many attributes need to be suppressed, the data can become meaningless. Therefore, achieving k-anonymity requires balancing data utility with privacy.

l-diversity

k-Anonymity is a widely used method to protect against identity disclosure by ensuring that each record is indistinguishable from at least \(k-1\) others based on quasi-identifiers. However, it falls short in preventing attribute disclosure. Attribute disclosure occurs when an attacker can infer sensitive information about an individual even if the individual’s exact identity is protected. This situation arises particularly when all records in a k-anonymous group share the same sensitive attribute value, making it easy for attackers to draw confident inferences.

To address this shortcoming, l-diversity extends k-anonymity by adding another layer of protection. It requires that sensitive attributes within each equivalence class have at least \(l\) well-represented distinct values. This means that even if an attacker knows the quasi-identifiers of an individual, they cannot easily determine the sensitive attribute. More formaly, the book defines it as follows:

Definition l-diversity: ‘A property of a dataset where for each equivalence class in the dataset, there are at least l well-represented values for the sensitive attribute.’

Example: Consider this simplified dataset that has achieved \(k=3\) anonymity:

# Create a dummy dataframe with 9 rows and 3 equivalence classes
df_dummy <- data.frame(
  Region = c(
    "Northern Virginia", "Northern Virginia", "Northern Virginia",
    "Central Virginia", "Central Virginia", "Central Virginia",
    "Eastern Virginia", "Eastern Virginia", "Eastern Virginia"
  ),
  sex = c(
    "Female", "Female", "Female",
    "Female", "Female", "Female",
    "Male", "Male", "Male"
  ),
  race = c(
    "White", "White", "White",
    "Asian", "Asian", "Asian",
    "Black", "Black", "Black"
  ),
  age_group = c(
    "30-49", "30-49", "30-49",
    "40-59", "40-59", "40-59",
    "10-29", "10-29", "10-29"
  ),
  education_group = c(
    "Bachelor's Degree", "Bachelor's Degree", "Bachelor's Degree",
    "High School", "High School", "High School",
    "Master's Degree", "Master's Degree", "Master's Degree"
  ),
  hourwage = c(
    25.00, 25.00, 25.00,
    16.75, 14.00, 17.20,
    30.10, 27.10, 30.10
  )
)

head(df_dummy, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
Region sex race age_group education_group hourwage
Northern Virginia Female White 30-49 Bachelor’s Degree 25.00
Northern Virginia Female White 30-49 Bachelor’s Degree 25.00
Northern Virginia Female White 30-49 Bachelor’s Degree 25.00
Central Virginia Female Asian 40-59 High School 16.75
Central Virginia Female Asian 40-59 High School 14.00
Central Virginia Female Asian 40-59 High School 17.20
Eastern Virginia Male Black 10-29 Master’s Degree 30.10
Eastern Virginia Male Black 10-29 Master’s Degree 27.10
Eastern Virginia Male Black 10-29 Master’s Degree 30.10

Analysis of Diversity

In this dataset, there are three equivalence classes:

  • First Class: (Northern Virginia, White, Female, age 30-49, Bachelor’s Degree)

    • This class is not diverse; it has only one unique hourwage value of 25.00. Thus, it is l = 1 diverse. If an attacker knows that Maria is in the database and is a 37-year-old White female with a Bachelor’s degree from Northern Virginia, they can confidently infer that she earns 25 dollars an hour.
  • Second Class: (Central Virginia, Asian, Female, age 40-59, High School)

    • This class is l = 3 diverse, with three different hourwage values: 16.75, 14.00, and 17.20.
  • Third Class: (Eastern Virginia, Black, Male, age 10-29, Master’s Degree)

    • This class is l = 2 diverse, with two different hourwage values: 30.10 and 27.10.

Challenges & Limitation

To achieve \(l = 2\) diversity across the entire dataset, simply generalizing or suppressing information is often insufficient. The first equivalence class lacks enough variation in the sensitive attribute (hourwage) to meet the l-diversity requirement. Addressing this would require either eliminating the class or altering the data significantly, which could compromise the utility of the dataset. This example demonstrates that while k-anonymity provides a baseline level of protection, it does not fully safeguard against attribute disclosure. Just as with k-anonymity, achieving l-diversity requires more complex strategies and often involves trade-offs between maintaining data utility and ensuring privacy.

Example: Suppose we augment our dataset to include a new sensitive attribute: Disease. This updated dataset maintains \(k = 3\) anonymity, as for every combination of quasi-identifiers, there are at least 3 records that share the same set of values. Additionally, the dataset achieves \(l = 2\) diversity because each equivalence class contains at least two distinct values for both sensitive attributes.

# Create a dummy dataframe with 9 rows and 3 equivalence classes, including the 'Disease' column
df_dummy <- data.frame(
  Region = c(
    "Northern Virginia", "Northern Virginia", "Northern Virginia",
    "Central Virginia", "Central Virginia", "Central Virginia",
    "Eastern Virginia", "Eastern Virginia", "Eastern Virginia"
  ),
  sex = c(
    "Female", "Female", "Female",
    "Female", "Female", "Female",
    "Male", "Male", "Male"
  ),
  race = c(
    "White", "White", "White",
    "Asian", "Asian", "Asian",
    "Black", "Black", "Black"
  ),
  age_group = c(
    "30-49", "30-49", "30-49",
    "40-59", "40-59", "40-59",
    "10-29", "10-29", "10-29"
  ),
  education_group = c(
    "Bachelor's Degree", "Bachelor's Degree", "Bachelor's Degree",
    "High School", "High School", "High School",
    "Master's Degree", "Master's Degree", "Master's Degree"
  ),
  hourwage = c(
    25.00, 26.00, 25.00,
    16.75, 14.00, 17.20,
    30.10, 27.10, 40.00
  ),
  Disease = c(
    "Diabetes", "Heart Disease", "Heart Disease",
    "Gastritis", "Stomach Cancer", "Gastric Ulcer",
    "Alopecia", "Diabetes", "Alopecia"
  )
)
head(df_dummy, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
Region sex race age_group education_group hourwage Disease
Northern Virginia Female White 30-49 Bachelor’s Degree 25.00 Diabetes
Northern Virginia Female White 30-49 Bachelor’s Degree 26.00 Heart Disease
Northern Virginia Female White 30-49 Bachelor’s Degree 25.00 Heart Disease
Central Virginia Female Asian 40-59 High School 16.75 Gastritis
Central Virginia Female Asian 40-59 High School 14.00 Stomach Cancer
Central Virginia Female Asian 40-59 High School 17.20 Gastric Ulcer
Eastern Virginia Male Black 10-29 Master’s Degree 30.10 Alopecia
Eastern Virginia Male Black 10-29 Master’s Degree 27.10 Diabetes
Eastern Virginia Male Black 10-29 Master’s Degree 40.00 Alopecia
  • Attribute Disclosure:

One of the major challenges of ensuring data privacy, even after implementing l-diversity, is the risk of attribute disclosure. Let’s consider a scenario where an attacker has some auxiliary information about an individual in the dataset. For example, suppose the attacker knows that Emily is an Asian female with a high school education. Even if the database has achieved \(k=3\) anonymity and \(l=2\) diversity, the attacker can still make concerning inferences.Despite not knowing Emily’s exact hourly wage, the attacker can confidently deduce that Emily earns less than 18 dollars per hour. This is because, within Emily’s equivalence class (Central Virginia, Asian, Female, age 40-59, High School), all wage values are below this threshold. Furthermore, the attacker can also infer that Emily has a health issue related to the stomach, as the only diseases represented within this class are “Gastritis,” “Stomach Cancer,” and “Gastric Ulcer.” This is what is known as a homogeneity attack.

t-closeness

While k-anonymity and l-diversity aim to protect against identity and attribute disclosure, respectively, they may still leave datasets vulnerable to more sophisticated inference attacks. Even with l-diversity, if the distribution of sensitive attributes within an equivalence class is vastly different from the overall distribution, attackers with background knowledge could infer sensitive information. This vulnerability is addressed through t-closeness. .

Definition: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole dataset does not exceed a specified threshold \(t\).

– A table/dataset is said to have t-closeness if all equivalence classes have t-closeness.

Earth Moving Distance

The Earth Mover’s Distance (EMD) measures the minimum “cost” to transform one distribution into another. This cost is calculated by summing over the absolute differences of the cumulative probabilities from each distribution up to each point \(i\), normalized by the total number of elements minus one (\(m - 1\)). By calculating the EMD between the distribution of the sensitive attribute in each equivalence class and the distribution in the full dataset, we can quantitatively measure how much the anonymized data deviates from the original distribution.

The provided formula for EMD is expressed as:

\[ E(P, Q) = \frac{1}{m - 1} \sum_{i=1}^m \left| \sum_{j=1}^i (p_j - q_j) \right| \]

where: - \(m\) is the number of elements in the distributions \(P\) and \(Q\). - \(p_j\) and \(q_j\) are the probabilities at position \(j\) in distributions \(P\) and \(Q\), respectively.

Setting a Threshold

Choosing an appropriate threshold \(t\) for t-closeness is crucial to ensuring the effectiveness of this privacy-preserving method. The threshold \(t\) determines the maximum allowable distance between the distribution of a sensitive attribute within each equivalence class and its distribution across the entire dataset. A smaller \(t\) value indicates a stricter privacy constraint.

Unfortunately, “[a]t this point, no universal threshold for t appears to exist.” The authors mention that “the t-closeness principle can be applied using other distance measures. While EMD is the best measure we have found so far, it is certainly not perfect. In particular, the relationship between the value t and information gain is unclear.” Because of this, it is crucial to consider the following aspects:

    1. Data Sensitivity: Highly sensitive data might necessitate a lower threshold to minimize the risk of revealing personal information through statistical inference.
    1. Data Distribution: Analyze the overall distribution of the sensitive attribute; a skewed distribution may require a stricter threshold.
    1. Regulatory Compliance: Ensure compliance with relevant data protection laws, which may influence the minimum acceptable threshold.
    1. Privacy vs. Data Utility: Balance the need for data privacy with the utility of the anonymized data. A lower threshold enhances privacy but may reduce utility.
    1. Iterative Refinement: Continuously monitor and adjust the threshold based on new data, legal changes, or advances in privacy research.

Step-by-Step Implementation

1. Calculate Overall Distribution of Sensitive Attributes:

Now that we understand the purpose of t-closeness, the first step in implementing t-closeness to our dataset is to determine the distribution of each sensitive attribute, in this case hourwage and diseae across the entire dataset. For the purposes of this example, we will focus on only hourwage attribute and not disease attribute as numeric data. A similar process should be applied to categorical data.

overall_distribution_hourwage <- df_dummy %>%
  summarise(mean_hourwage = mean(hourwage, na.rm = TRUE),
            sd_hourwage = sd(hourwage, na.rm = TRUE))

head(overall_distribution_hourwage, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
mean_hourwage sd_hourwage
24.57222 7.942572
overall_distribution_disease <- df_dummy %>%
  count(Disease) %>%
  mutate(proportion = n / sum(n))

head(overall_distribution_disease, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
Disease n proportion
Alopecia 2 0.2222222
Diabetes 2 0.2222222
Gastric Ulcer 1 0.1111111
Gastritis 1 0.1111111
Heart Disease 2 0.2222222
Stomach Cancer 1 0.1111111

2. Calculate Distribution Within Each Equivalence Class:

For each group of records that share the same quasi-identifiers (e.g., Region, sex, race, age_group, education_group), compute the distribution of the sensitive attributes.

equivalence_classes <- df_dummy %>%
  group_by(Region, sex, race, age_group, education_group) %>%
  summarise(mean_hourwage = mean(hourwage, na.rm = TRUE),
            sd_hourwage = sd(hourwage, na.rm = TRUE),
            disease_distribution = list(table(Disease) / length(Disease)),
            .groups = "drop")

head(equivalence_classes, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
Region sex race age_group education_group mean_hourwage sd_hourwage disease_distribution
Central Virginia Female Asian 40-59 High School 15.98333 1.7322914 0.3333333, 0.3333333, 0.3333333
Eastern Virginia Male Black 10-29 Master’s Degree 32.40000 6.7505555 0.6666667, 0.3333333
Northern Virginia Female White 30-49 Bachelor’s Degree 25.33333 0.5773503 0.3333333, 0.6666667

3. Compare Distributions Using a Distance Metric:

Use a distance metric (like the Earth Mover’s Distance or Jensen-Shannon Divergence) to quantify how different the distributions within each equivalence class are from the overall distribution. For simplicity, let’s calculate the absolute difference in means:

equivalence_classes <- equivalence_classes %>%
  mutate(hourwage_diff = abs(mean_hourwage - overall_distribution_hourwage$mean_hourwage))

head(equivalence_classes, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
Region sex race age_group education_group mean_hourwage sd_hourwage disease_distribution hourwage_diff
Central Virginia Female Asian 40-59 High School 15.98333 1.7322914 0.3333333, 0.3333333, 0.3333333 8.5888889
Eastern Virginia Male Black 10-29 Master’s Degree 32.40000 6.7505555 0.6666667, 0.3333333 7.8277778
Northern Virginia Female White 30-49 Bachelor’s Degree 25.33333 0.5773503 0.3333333, 0.6666667 0.7611111

4. Adjust the Data to Meet t-Closeness:

If the difference exceeds a predefined threshold \(t\), adjust the data. This might involve:

  • Smoothing the values of the sensitive attribute.

    • For the Central Virginia equivalence class, we may increase the hourwage values slightly to bring the mean closer to 24.57. Given the low standard deviation, adjustments should be subtle to maintain the class’s internal consistency.

    • For the Eastern Virginia class, we may decrease the hourwage values to make the mean closer to 24.57, while considering the existing variability to avoid over-smoothing.

    • For the Northern Virginia class, minimal or no adjustment is needed since the hourwage mean is already close to the overall mean.

  • Grouping records further or making other data transformations to balance the distributions.

overall_mean <- overall_distribution_hourwage$mean_hourwage
t <- 0.1  # 10% threshold

equivalence_classes <- equivalence_classes %>%
  mutate(t_closeness = ifelse(hourwage_diff <= overall_mean * t, TRUE, FALSE))
# Install the transport package
#install.packages("transport")

# Load the package
library(transport)
## Warning: package 'transport' was built under R version 4.3.3
# Overall wage distribution
overall_wage <- df_dummy$hourwage

# Initialize an empty list to store results
results <- list()

# Create a unique identifier for each group based on quasi-identifiers
df_dummy$GroupID <- with(df_dummy, interaction(Region, sex, race, age_group, education_group))

# Unique groups
unique_groups <- unique(df_dummy$GroupID)

# Loop over each group to calculate EMD
for (group_id in unique_groups) {
  subgroup <- df_dummy[df_dummy$GroupID == group_id, ]
  subgroup_wage <- subgroup$hourwage
  
  # Calculate Wasserstein distance
  emd_value <- wasserstein1d(subgroup_wage, overall_wage)
  
  # Store results
  results[[as.character(group_id)]] <- emd_value
}

# Display results
results
## $`Northern Virginia.Female.White.30-49.Bachelor's Degree`
## [1] 5.25
## 
## $`Central Virginia.Female.Asian.40-59.High School`
## [1] 8.588889
## 
## $`Eastern Virginia.Male.Black.10-29.Master's Degree`
## [1] 7.827778
# Example Data
df_data <- data.frame(
  ZIP_Code = c("4767*", "4767*", "4767*", "4790*", "4790*", "4790*", "4760*", "4760*", "4760*"),
  Age = c("<= 40", "<= 40", "<= 40", ">= 40", ">= 40", ">= 40", "<= 40", "<= 40", "<= 40"),
  Salary = c("3K", "5K", "9K", "6K", "11K", "8K", "4K", "7K", "10K"),
  Disease = c("gastric ulcer", "stomach cancer", "pneumonia", "gastritis", "flu", "bronchitis", "gastritis", "bronchitis", "stomach cancer")
)

# Convert Salary to numeric
df_data$Salary <- as.numeric(sub("K", "", df_data$Salary)) * 1000
table(df_data$Salary) / nrow(df_data)
## 
##      3000      4000      5000      6000      7000      8000      9000     10000 
## 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 
##     11000 
## 0.1111111
create_auto_distributions <- function(data, attribute) {
  # Determine unique values in the specified attribute to form equivalence classes
  unique_values <- unique(data[[attribute]])
  
  # Initialize list to hold distributions
  distributions <- list()
  
  # Total number of elements
  n <- nrow(data)
  
  # Loop through each unique value to create distributions
  for (value in unique_values) {
    # Create a zero vector of length n
    distribution <- numeric(n)
    
    # Find rows where the attribute matches the current unique value
    indices <- which(data[[attribute]] == value)
    
    # Assign equal probabilities to these indices
    distribution[indices] <- 1 / length(indices)
    
    # Name the distribution based on the unique value
    distributions[[paste(attribute, value, sep = "_")]] <- distribution
  }
  
  # Return the list of distributions
  return(distributions)
}

create_auto_distributions(df_data, "Salary")
## $Salary_3000
## [1] 1 0 0 0 0 0 0 0 0
## 
## $Salary_5000
## [1] 0 1 0 0 0 0 0 0 0
## 
## $Salary_9000
## [1] 0 0 1 0 0 0 0 0 0
## 
## $Salary_6000
## [1] 0 0 0 1 0 0 0 0 0
## 
## $Salary_11000
## [1] 0 0 0 0 1 0 0 0 0
## 
## $Salary_8000
## [1] 0 0 0 0 0 1 0 0 0
## 
## $Salary_4000
## [1] 0 0 0 0 0 0 1 0 0
## 
## $Salary_7000
## [1] 0 0 0 0 0 0 0 1 0
## 
## $Salary_10000
## [1] 0 0 0 0 0 0 0 0 1
# Define the distributions for uniform and equivalence classes
Q3 <- rep(1/9, 9)
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
P32 <- c(0, 0, 0, 1/3, 0, 1/3, 0, 0, 1/3)
P33 <- c(0, 0, 0, 0, 1/3, 0, 1/3, 1/3, 0)

# Define the distributions
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
Q3 <- rep(1/9, 9)

# Calculate EMD using the defined formula
emd_function <- function(P, Q) {
  m <- length(P)  # number of positions in the distributions
  abs_cumulative_diffs <- abs(cumsum(P - Q))  # absolute cumulative differences
  emd <- sum(abs_cumulative_diffs) / (m - 1)  # EMD calculation
  return(emd)
}

# Calculate EMD for P31 vs. Q3
emd_P31 <- emd_function(P31, Q3)
emd_P32 <- emd_function(P32, Q3)
emd_P33 <- emd_function(P33, Q3)

print(sprintf("EMD P3,1 vs. Q3: %f", emd_P31))
## [1] "EMD P3,1 vs. Q3: 0.375000"
print(sprintf("EMD P3,2 vs. Q3: %f", emd_P32))
## [1] "EMD P3,2 vs. Q3: 0.166667"
print(sprintf("EMD P3,3 vs. Q3: %f", emd_P33))
## [1] "EMD P3,3 vs. Q3: 0.236111"
# Parameters
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
Q3 <- rep(1/9, 9)

# Compute differences
differences <- P31 - Q3

# Compute cumulative sums of differences to find how much mass to move
cum_diff <- cumsum(differences)

# The total work is the sum of absolute cumulative differences
# (except the last one, because it should theoretically be zero if all previous are correct)
total_work <- sum(abs(cum_diff[-length(cum_diff)]))

# Since we move masses and we need to consider the distance moved, we calculate the EMD
distances <- 1:(length(P31) - 1)  # distances from each position to the next
emd_value <- sum(abs(cum_diff[-length(cum_diff)]) * distances)

# Normalize by the total mass moved
emd_value_normalized <- emd_value / sum(P31)

# Print the EMD result
cat("EMD for P3,1 vs. Q3:", emd_value_normalized, "\n")
## EMD for P3,1 vs. Q3: 12

Fair Relabelling

Let’s now turn to the discrimination aspect of fairness. How can we measure that a historical dataset contains a bias against a sensitive group, and how can we change the dataset in a preprocessing step so as to remove this measured bias from the dataset?

Measuing Fairness of a Dataset

  • Positive outcome: An outcome we want to balance, like median hourly wage.

  • Sensitive group (S): Group that might face discrimination (e.g., Black).

  • Non-sensitive group (NS): Group that may have an advantage (e.g., White).

Step-by-step Implementation

For this example, let’s assume df2 contains the data with columns race and hourwage.

# Grab the race and hourwage columns from df2
df_race_outcome <- df2 %>%
  select(race, hourwage)

Step 1: Create the outcome variable Y based on the median hourwage

# Calculate the median of hourwage
median_hourwage <- mean(df_race_outcome$hourwage, na.rm = TRUE)

# Create the outcome variable Y: "+" if hourwage >= median_hourwage, "-" otherwise
df_race_outcome <- df_race_outcome %>%
  mutate(Y = ifelse(hourwage >= median_hourwage, 1, 0))

# View the resulting dataframe
head(df_race_outcome, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
race hourwage Y
White 11.49 0
White 14.50 0
White 9.00 0
White 7.50 0
White 7.25 0
White 10.42 0
White 13.00 0
White 12.00 0
White 28.75 1
White 13.00 0

Step 2: Calculate the number of positive outcomes for non-sensitive (White) and sensitive (Black) groups

ns_positive <- df_race_outcome %>% filter(race == "White" & Y == 1) %>% nrow()  # Count of positive outcomes for 'White'
s_positive <- df_race_outcome %>% filter(race == "Black" & Y == 1) %>% nrow()   # Count of positive outcomes for 'Black'

# Print the results
cat("Number of positive outcomes for White:", ns_positive, "\n")
## Number of positive outcomes for White: 69361
cat("Number of positive outcomes for Black:", s_positive, "\n")
## Number of positive outcomes for Black: 7142

Step 3: Calculate the total number of instances for non-sensitive (White) and sensitive (Black) groups

ns_total <- df_race_outcome %>% filter(race == "White") %>% nrow()  # Count of positive outcomes for 'White'
s_total <- df_race_outcome %>% filter(race == "Black") %>% nrow()   # Count of positive outcomes for 'Black'

# Print the values
cat("Total instances for White group (ns_total):", ns_total, "\n")
## Total instances for White group (ns_total): 184338
cat("Total instances for Black group (s_total):", s_total, "\n")
## Total instances for Black group (s_total): 26113

Step 4: Calculate the probabilities

P_ns_positive <- ns_positive / ns_total  # Probability of positive outcome for 'ns'
P_s_positive <- s_positive / s_total      # Probability of positive outcome for 's'
# Print the probabilities
cat("Probability of positive outcome for White (P_ns_positive):", P_ns_positive, "\n")
## Probability of positive outcome for White (P_ns_positive): 0.3762708
cat("Probability of positive outcome for Black (P_s_positive):", P_s_positive, "\n")
## Probability of positive outcome for Black (P_s_positive): 0.2735036

Step 5: Calculate Statistical Parity and Disparate Impact

statistical_parity <- P_ns_positive - P_s_positive  # Absolute difference in probabilities
disparate_impact <- P_ns_positive / P_s_positive    # Ratio of probabilities

cat("Statistical Parity:", statistical_parity, "\n")
## Statistical Parity: 0.1027671
cat("Disparate Impact:", disparate_impact, "\n")
## Disparate Impact: 1.375743
  • Statistical Parity measures the absolute difference in the probability of a positive outcome between the non-sensitive (White) and sensitive (Black) groups. Here, a value of 0.1028 means that White individuals have about a 10.3% higher probability of receiving a positive outcome compared to Black individuals.

  • Disparate Impact measures the ratio of the probability of a positive outcome between the two groups. A value of 1.376 indicates that White individuals are about 1.38 times more likely to receive a positive outcome than Black individuals.

These results suggest that there might be a bias in the dataset favoring White individuals over Black individuals for positive outcomes. However, disparate impact t doesn’t consider too many other factors to matter. Thus, further examination of the data must be to asses the fairness more accurately.

S### Massaging

Massaging involves relabeling a specific number of instances to balance outcomes across groups.

To calculate the number of label changes \(M\) required to balance your dataset, you can use the formula:

\[ M = \frac{\text{disc}(D) \cdot s_T \cdot ns_T}{s_T + ns_T} \]

where:

  • disc(D): is the measured discrimination in the dataset (Statistical Parity).
  • s_T: is the total count of individuals in the sensitive group.
  • ns_T: is the total count of individuals in the non-sensitive group.

Let’s substitute your calculated values into this formula:

  • disc(D) (Statistical Parity): 0.1028
  • s_T (Total for Black group): 26,113
  • ns_T (Total for White group): 184,338

\[ M = \frac{0.1028 \times 26113 \times 184338}{26113 + 184338} \]

The result for \(M\) will provide the number of label changes needed to reduce observed bias in the dataset by switching outcomes for individuals in both the sensitive and non-sensitive groups.

M <- (statistical_parity * s_total * ns_total) / (s_total + ns_total)

# Print the result for M
cat("Number of label changes required (M):", M, "\n")
## Number of label changes required (M): 2350.579