df1 = read.csv("C:\\Users\\luzya\\OneDrive\\Escritorio\\PhD program\\Fall 2024\\DS 8998 - Farhana\\df_ex2.csv")

df2 = read.csv("C:\\Users\\luzya\\OneDrive\\Escritorio\\PhD program\\Fall 2024\\DS 8998 - Farhana\\df_ex1.csv")

head(df2, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

id	Zip_Code	sex	race	age	education	hourwage
ID0008	23221	Female	White	30	Associate’s degree, occupational/vocational program	11.49
ID0009	23460	Male	White	33	Associate’s degree, occupational/vocational program	14.50
ID0077	22153	Male	White	60	High school diploma or equivalent	9.00
ID0080	22303	Female	White	41	High school diploma or equivalent	7.50
ID0081	22181	Female	White	17	Grade 11	7.25
ID0083	24015	Female	White	31	Associate’s degree, academic program	10.42
ID0084	23832	Male	White	31	High school diploma or equivalent	13.00
ID0088	23603	Male	White	53	Grade 11	12.00
ID0089	23323	Male	White	45	Some college but no degree	28.75
ID0090	22303	Female	White	38	Professional school degree	13.00

# Identifying distinct entries for the variables
distinct_entries <- df2 %>%
  summarise(
    age = n_distinct(age),
    education = n_distinct(education),
    sex = n_distinct(sex),
    race = n_distinct(race),
    Zip_Code = n_distinct(Zip_Code)
  )

distinct_entries

##   age education sex race Zip_Code
## 1  67        16   2    2      226

Exploratory Data Analysis

ggplot(df2,aes(x=hourwage)) +
  geom_histogram(color="Black",fill="skyblue")+
  labs(title = 'Histogram of Hourly Wage', x = 'Hourly Wage', y = 'Frequency')+
  theme(plot.title = element_text(hjust=0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

race_counts <- table(df2$race)

# Calculate percentages
race_percentages <- prop.table(race_counts) * 100

# Combine counts and percentages into a data frame for better readability
race_table <- data.frame(
  Count = race_counts,
  Percentage = sprintf("%.2f%%", race_percentages)  # Format as percentage
)

print("Table for Race Distribution:")

## [1] "Table for Race Distribution:"

print(race_table)

##   Count.Var1 Count.Freq Percentage
## 1      Black      26113     12.41%
## 2      White     184338     87.59%

library(table1)
table1::label(df2$age) <- "Age"
table1::label(df2$hourwage) <- "Hourly Wage"
table1::label(df2$education) <- "Education Level"
table1::label(df2$sex) <- "Sex"

table1::table1(~ age + hourwage + sex + education | race, data = df2)

	Black (N=26113)	White (N=184338)	Overall (N=210451)
Age
Mean (SD)	40.8 (14.7)	40.8 (15.4)	40.8 (15.3)
Median [Min, Max]	40.0 [15.0, 85.0]	40.0 [15.0, 85.0]	40.0 [15.0, 85.0]
Hourly Wage
Mean (SD)	17.3 (9.33)	19.6 (11.0)	19.3 (10.8)
Median [Min, Max]	15.0 [1.00, 100]	16.8 [0.0100, 100]	16.5 [0.0100, 100]
Sex
Female	14696 (56.3%)	92721 (50.3%)	107417 (51.0%)
Male	11417 (43.7%)	91617 (49.7%)	103034 (49.0%)
Education Level
12th grade, no diploma	515 (2.0%)	2716 (1.5%)	3231 (1.5%)
Associate's degree, academic program	1784 (6.8%)	13227 (7.2%)	15011 (7.1%)
Associate's degree, occupational/vocational program	1244 (4.8%)	10943 (5.9%)	12187 (5.8%)
Bachelor's degree	3296 (12.6%)	30087 (16.3%)	33383 (15.9%)
Doctorate degree	130 (0.5%)	988 (0.5%)	1118 (0.5%)
Grade 10	439 (1.7%)	3959 (2.1%)	4398 (2.1%)
Grade 11	870 (3.3%)	5637 (3.1%)	6507 (3.1%)
Grade 9	240 (0.9%)	2986 (1.6%)	3226 (1.5%)
Grades 1, 2, 3, or 4	67 (0.3%)	955 (0.5%)	1022 (0.5%)
Grades 5 or 6	93 (0.4%)	2192 (1.2%)	2285 (1.1%)
Grades 7 or 8	163 (0.6%)	2172 (1.2%)	2335 (1.1%)
High school diploma or equivalent	9985 (38.2%)	62735 (34.0%)	72720 (34.6%)
Master's degree	1015 (3.9%)	7264 (3.9%)	8279 (3.9%)
None or preschool	47 (0.2%)	371 (0.2%)	418 (0.2%)
Professional school degree	96 (0.4%)	775 (0.4%)	871 (0.4%)
Some college but no degree	6129 (23.5%)	37331 (20.3%)	43460 (20.7%)

Defining and Measuring Privacy

k-anonymity

K-anonymity ensures that each individual in a dataset cannot be distinguished from at least k-1 other individuals with respect to the quasi-identifiers in the dataset. This is done through grouping and suppression. Applying k-anonymity makes it more difficult for an attacker to re-identify specific individuals in the dataset.

Vocabulary

Identifiers (also known as key attributes): direct identifiers such as names, student numbers, email addresses, etc. These variables should in principle not be collected at all, or removed from the dataset if they are not necessary for your research project.
Quasi-identifiers: indirect identifiers that can lead to identification when combined with other quasi-identifiers in the dataset or external information. These are often demographic variables like age, sex, place of residence, etc., but could also be something entirely different like physical characteristics, timestamps, etc. In general, quasi-identifiers are usually variables that are likely to be known to someone in the outside world.
Sensitive attributes: variables of interest which should be protected, and which cannot be changed, because they are the main outcome variables. For example, it can be Medical condition in a healthcare dataset, or Income in a financial dataset.

It is important to correctly categorise the variables in your dataset as any of these variable types if you want to apply k-anonymity because they will determine how the dataset will be de-identified. To make a dataset k-anonymous, you must first identify which variables in the dataset are identifiers, quasi-identifiers, and sensitive attributes.

In the example above, we have that

Identifiers: id
Quasi-identifiers: zip_code, age, sex, race, and education
Sensitive attributes: hourwage

Next, you should set a value for k. If we choose a k of 2, every row in the dataset should have the same combination of quasi-identifiers as at least 1 other row in the dataset. Finally, you aggregate the dataset so that every combination of quasi-identifiers occurs at least k times.

Suppressing, Grouping, and Perturbing

Step-by-Step Implementation

Supressing

# Applying suppression by removing unique identifier
df2$id <- NULL

head(df2, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

Zip_Code	sex	race	age	education	hourwage
23221	Female	White	30	Associate’s degree, occupational/vocational program	11.49
23460	Male	White	33	Associate’s degree, occupational/vocational program	14.50
22153	Male	White	60	High school diploma or equivalent	9.00
22303	Female	White	41	High school diploma or equivalent	7.50
22181	Female	White	17	Grade 11	7.25
24015	Female	White	31	Associate’s degree, academic program	10.42
23832	Male	White	31	High school diploma or equivalent	13.00
23603	Male	White	53	Grade 11	12.00
23323	Male	White	45	Some college but no degree	28.75
22303	Female	White	38	Professional school degree	13.00

Grouping

df3 = df1

df3$age_group <- cut(df1$age, breaks=c(10, 20, 30, 40, 50, 60, 70, 80, 90), right=FALSE, labels=c("10s", "20s", "30s", "40s", "50s", "60s", "70s", "80s"))

df3$age <- NULL

df3$Zip_Code <- NULL

df3$id <- NULL

# Specify a new order where "hourwage" and "zip_code" are swapped
new_order <- c("County", "sex", "race", "age_group", "education", "hourwage")

# Reorder the dataframe columns according to the new order
df3 <- df3[, new_order]

# Apply this method after modelling (otherwise is classification)
#max_wage <- ceiling(max(df1$hourwage) / 5) * 5  # This rounds up to the nearest 5
#df1$wage_group <- cut(df1$hourwage, breaks=seq(0, max_wage, by=5), include.lowest=TRUE)

head(df4, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

County	sex	race	age_group	education_group	hourwage
Richmond	Female	White	30s	Associate Degree or more	11.49
Virginia Beach	Male	White	30s	Associate Degree or more	14.50
Springfield	Male	White	60s	High school diploma or equivalent	9.00
Alexandria	Female	White	40s	High school diploma or equivalent	7.50
Vienna	Female	White	10s	Less than High School diploma	7.25
Roanoke	Female	White	30s	Associate Degree or more	10.42
Chesterfield	Male	White	30s	High school diploma or equivalent	13.00
Newport News	Male	White	50s	Less than High School diploma	12.00
Chesapeake	Male	White	40s	Some college but no degree	28.75
Alexandria	Female	White	30s	Associate Degree or more	13.00

Perturbing

Perturbing is a privacy-preserving technique that protects data by adding small amounts of random noise or modifying values in a controlled way. The goal of perturbing is to obscure individual data points, making it harder to re-identify individuals while retaining the dataset’s overall statistical properties.

In our dataset, this could involve adding or subtracting a small amount to the age variable instead of grouping. Since we have already performed generalization through grouping, we will not add perturbation to this variable.

table1::label(df4$age_group) <- "Age Group"
table1::label(df4$hourwage) <- "Hourly Wage"
table1::label(df4$education_group) <- "Education Group"
table1::label(df4$sex) <- "Sex"

table1::table1(~ age_group + hourwage + sex + education_group | race, data = df4)

	Black (N=26113)	White (N=184338)	Overall (N=210451)
Age Group
10s	1315 (5.0%)	11807 (6.4%)	13122 (6.2%)
20s	6001 (23.0%)	43024 (23.3%)	49025 (23.3%)
30s	5529 (21.2%)	37103 (20.1%)	42632 (20.3%)
40s	5219 (20.0%)	33303 (18.1%)	38522 (18.3%)
50s	4913 (18.8%)	33679 (18.3%)	38592 (18.3%)
60s	2531 (9.7%)	20418 (11.1%)	22949 (10.9%)
70s	509 (1.9%)	4288 (2.3%)	4797 (2.3%)
80s	96 (0.4%)	716 (0.4%)	812 (0.4%)
Hourly Wage
Mean (SD)	17.3 (9.33)	19.6 (11.0)	19.3 (10.8)
Median [Min, Max]	15.0 [1.00, 100]	16.8 [0.0100, 100]	16.5 [0.0100, 100]
Sex
Female	14696 (56.3%)	92721 (50.3%)	107417 (51.0%)
Male	11417 (43.7%)	91617 (49.7%)	103034 (49.0%)
Education Group
Associate Degree or more	7565 (29.0%)	63284 (34.3%)	70849 (33.7%)
High school diploma or equivalent	9985 (38.2%)	62735 (34.0%)	72720 (34.6%)
Less than High School diploma	2434 (9.3%)	20988 (11.4%)	23422 (11.1%)
Some college but no degree	6129 (23.5%)	37331 (20.3%)	43460 (20.7%)

# Identifying distinct entries for the variables
distinct_entries <- df4 %>%
  summarise(
    age_group = n_distinct(age_group),
    education = n_distinct(education_group),
    sex = n_distinct(sex),
    race = n_distinct(race),
    County = n_distinct(County)
  )

distinct_entries

##   age_group education sex race County
## 1         8         4   2    2     52

Observations:

The age group is now represented by only 7 ranges, instead of being a continuous variable. This reduces the number of distict ages from 67 to 7.
The education variable now has only 4 levels, instead of 16.
The Zip_Code variable had 226 distinct values. After generalization, this number got reduced to 52.

In k-anonymity, each unique combination of quasi-identifiers must appear at least k times in the dataset to ensure that individuals cannot be easily re-identified. The data frame below shows each distinct set of quasi-identifiers, along with a count indicating how many records share that exact combination. This dataframe tell us that there are 5,773 possible combinations of quasi-identifiers in our dataset.

# Count the number of occurrences of each combination of quasi-identifiers
combinations <- df4 %>%
  group_by(County, sex, race, age_group, education_group) %>%
  summarise(count = n(), .groups = "drop")

# Display combinations and their counts
head(combinations, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

County	sex	race	age_group	education_group	count
Abingdon	Female	Black	10s	High school diploma or equivalent	2
Abingdon	Female	Black	10s	Less than High School diploma	1
Abingdon	Female	Black	10s	Some college but no degree	2
Abingdon	Female	Black	20s	Associate Degree or more	6
Abingdon	Female	Black	20s	High school diploma or equivalent	15
Abingdon	Female	Black	20s	Less than High School diploma	1
Abingdon	Female	Black	20s	Some college but no degree	8
Abingdon	Female	Black	30s	Associate Degree or more	8
Abingdon	Female	Black	30s	High school diploma or equivalent	11
Abingdon	Female	Black	30s	Less than High School diploma	2

Out of those possible combinations, we wish to find how many combinations do not meet k-anonymality. To do this, we filter the dataset to only include the rows which count values are less than or equal to \(k=2\).

k = 2

# Find combinations that do not meet k-anonymity (where count < k)
unique_combinations <- combinations %>%
  filter(count < k)

head(unique_combinations, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

County	sex	race	age_group	education_group	count
Abingdon	Female	Black	10s	Less than High School diploma	1
Abingdon	Female	Black	20s	Less than High School diploma	1
Abingdon	Female	Black	70s	Associate Degree or more	1
Abingdon	Female	Black	70s	High school diploma or equivalent	1
Abingdon	Female	Black	80s	High school diploma or equivalent	1
Abingdon	Female	Black	80s	Less than High School diploma	1
Abingdon	Female	White	70s	Less than High School diploma	1
Abingdon	Female	White	80s	Associate Degree or more	1
Abingdon	Male	Black	30s	Less than High School diploma	1
Abingdon	Male	Black	50s	Some college but no degree	1

After filtering out the data to include only rows with count values less than 2, we notice that 573 rows, or about 10% of our data, does not satisfy k-anonymality. Thus, we applied additional generalization techniques in the dataset below. Instead of using 10-year age intervals, we broadened the age groups to 20-year ranges. Additionally, we reduced the geographical detail by grouping individuals into 6 broader regions (Northern Virginia, Central Virginia, Eastern Virginia, Western Virginia, Southwest Virginia, and Washington), rather than the original 52 distinct counties.

# Load dplyr
library(dplyr)

# Apply regional generalization
df5 <- df4 %>%
  mutate(Region = case_when(
    County %in% c("Alexandria", "Arlington", "Centreville", "Chantilly", "Fairfax", "Falls Church", 
                  "Herndon", "Leesburg", "Mc Lean", "Manassas", "Reston", "Sterling", 
                  "Vienna", "Warrenton", "Woodbridge") ~ "Northern Virginia",
    County %in% c("Charlottesville", "Chester", "Chesterfield", "Fredericksburg", "Glen Allen", 
                  "Henrico", "Mechanicsville", "Midlothian", "Richmond", "Spotsylvania", 
                  "Stafford") ~ "Central Virginia",
    County %in% c("Chesapeake", "Hampton", "Newport News", "Norfolk", "Portsmouth", 
                  "Suffolk", "Virginia Beach", "Williamsburg", "Yorktown") ~ "Eastern Virginia",
    County %in% c("Blacksburg", "Harrisonburg", "Lynchburg", "Winchester") ~ "Western Virginia",
    County %in% c("Abingdon", "Bristol", "Danville", "Farmville", "Radford") ~ "Southwest Virginia",
    County == "Washington" ~ "Washington",
    TRUE ~ "Unknown"
  ))

df5$County <- NULL

# Specify a new order where "hourwage" and "zip_code" are swapped
new_order <- c("Region", "sex", "race", "age_group", "education_group", "hourwage")


# Generalize age to 20-year intervals
df5 <- df5 %>%
  mutate(age_group = case_when(
    age_group %in% c("10s", "20s") ~ "10-29",
    age_group %in% c("30s", "40s") ~ "30-49",
    age_group %in% c("50s", "60s") ~ "50-69",
    age_group %in% c("70s", "80s") ~ "70-89",
    TRUE ~ age_group  # Catch any other categories, if needed
  ))

# Reorder the dataframe columns according to the new order
df5 <- df5[, new_order]

# Count the number of occurrences of each combination of quasi-identifiers
combinations <- df5 %>%
  group_by(Region, sex, race, age_group, education_group) %>%
  summarise(count = n(), .groups = "drop")

# Sort the combinations dataset by the 'count' column in ascending order
combinations_sorted <- combinations %>%
  arrange(count)

# Display the sorted dataset
head(combinations_sorted, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

Region	sex	race	age_group	education_group	count
Southwest Virginia	Male	Black	70-89	Associate Degree or more	1
Southwest Virginia	Male	Black	70-89	Less than High School diploma	1
Washington	Female	Black	70-89	High school diploma or equivalent	1
Washington	Female	Black	70-89	Less than High School diploma	1
Washington	Male	Black	70-89	High school diploma or equivalent	1
Western Virginia	Female	Black	70-89	High school diploma or equivalent	1
Southwest Virginia	Female	Black	70-89	Less than High School diploma	2
Southwest Virginia	Female	Black	70-89	Some college but no degree	2
Unknown	Male	Black	70-89	Less than High School diploma	2
Washington	Male	Black	10-29	Less than High School diploma	2

Challenges & Limitations

Despite efforts to achieve k-anonymity through suppression and generalization in our example, the dataset still contains six unique combinations of quasi-identifiers that fail to meet the k-anonymity requirement. Ideally, each combination of Region, sex, race, age_group, and education_group should ideally occur at least \(k=2\) times to prevent re-identification of individuals. However, the first six rows have a count of \(k-1\), indicating that these combinations are still too specific and make individuals identifiable. This introduces the following challenge:

High Granularity: Our dataset exhibits high granularity, meaning that the data is broken down into highly detailed and specific combinations of quasi-identifiers. This level of detail makes it extremely challenging to group rows into equivalence classes that satisfy k-anonymity. Even after generalizing the data further by broadening age groups and simplifying location details, the variety of attributes such as region or education level still resulted in six unique combinations. These unique rows, which constitute only 0.1% of the data, highlight a critical limitation: achieving perfect k-anonymity for every record can be exceptionally difficult in datasets with significant variability and diverse populations. Because of this, one could argue that the dataset achieves a practical level of k-anonymity, as the non-compliant portion is quite small.
Imbalance and Minority Representation: Another significant challenge comes from the representation of minority groups in the dataset. For example, a substantial portion of the quasi-identifier combinations that fail to meet k-anonymity are associated with individuals of the Black race. This is because, as a minority group, there are fewer instances of Black individuals in the dataset, making it less likely for them to share the same set of quasi-identifiers as others. Consequently, these individuals are more prone to being identifiable even after generalization and suppression.
The Trade-Off Dilemma: Let’s consider that we are still not satisfied and want our entire dataset to fully satisfy k-anonymity. Achieving this would require even more extensive suppression and generalization. For instance, we might need to group individuals into broader age ranges spanning 30 years and simplify educational levels to just two categories: “More than a Bachelor’s Degree” and “Bachelor’s Degree or Less.” While this would reduce the number of unique equivalence classes and make it easier to satisfy k-anonymity, it comes at a cost. However, if too many attributes need to be suppressed, the data can become meaningless. Therefore, achieving k-anonymity requires balancing data utility with privacy.

l-diversity

k-Anonymity is a widely used method to protect against identity disclosure by ensuring that each record is indistinguishable from at least \(k-1\) others based on quasi-identifiers. However, it falls short in preventing attribute disclosure. Attribute disclosure occurs when an attacker can infer sensitive information about an individual even if the individual’s exact identity is protected. This situation arises particularly when all records in a k-anonymous group share the same sensitive attribute value, making it easy for attackers to draw confident inferences.

To address this shortcoming, l-diversity extends k-anonymity by adding another layer of protection. It requires that sensitive attributes within each equivalence class have at least \(l\) well-represented distinct values. This means that even if an attacker knows the quasi-identifiers of an individual, they cannot easily determine the sensitive attribute. More formaly, the book defines it as follows:

Definition l-diversity: ‘A property of a dataset where for each equivalence class in the dataset, there are at least l well-represented values for the sensitive attribute.’

Example: Consider this simplified dataset that has achieved \(k=3\) anonymity:

# Create a dummy dataframe with 9 rows and 3 equivalence classes
df_dummy <- data.frame(
  Region = c(
    "Northern Virginia", "Northern Virginia", "Northern Virginia",
    "Central Virginia", "Central Virginia", "Central Virginia",
    "Eastern Virginia", "Eastern Virginia", "Eastern Virginia"
  ),
  sex = c(
    "Female", "Female", "Female",
    "Female", "Female", "Female",
    "Male", "Male", "Male"
  ),
  race = c(
    "White", "White", "White",
    "Asian", "Asian", "Asian",
    "Black", "Black", "Black"
  ),
  age_group = c(
    "30-49", "30-49", "30-49",
    "40-59", "40-59", "40-59",
    "10-29", "10-29", "10-29"
  ),
  education_group = c(
    "Bachelor's Degree", "Bachelor's Degree", "Bachelor's Degree",
    "High School", "High School", "High School",
    "Master's Degree", "Master's Degree", "Master's Degree"
  ),
  hourwage = c(
    25.00, 25.00, 25.00,
    16.75, 14.00, 17.20,
    30.10, 27.10, 30.10
  )
)

head(df_dummy, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

Region	sex	race	age_group	education_group	hourwage
Northern Virginia	Female	White	30-49	Bachelor’s Degree	25.00
Northern Virginia	Female	White	30-49	Bachelor’s Degree	25.00
Northern Virginia	Female	White	30-49	Bachelor’s Degree	25.00
Central Virginia	Female	Asian	40-59	High School	16.75
Central Virginia	Female	Asian	40-59	High School	14.00
Central Virginia	Female	Asian	40-59	High School	17.20
Eastern Virginia	Male	Black	10-29	Master’s Degree	30.10
Eastern Virginia	Male	Black	10-29	Master’s Degree	27.10
Eastern Virginia	Male	Black	10-29	Master’s Degree	30.10

Analysis of Diversity

In this dataset, there are three equivalence classes:

First Class: (Northern Virginia, White, Female, age 30-49, Bachelor’s Degree)
- This class is not diverse; it has only one unique hourwage value of 25.00. Thus, it is l = 1 diverse. If an attacker knows that Maria is in the database and is a 37-year-old White female with a Bachelor’s degree from Northern Virginia, they can confidently infer that she earns 25 dollars an hour.
Second Class: (Central Virginia, Asian, Female, age 40-59, High School)
- This class is l = 3 diverse, with three different hourwage values: 16.75, 14.00, and 17.20.
Third Class: (Eastern Virginia, Black, Male, age 10-29, Master’s Degree)
- This class is l = 2 diverse, with two different hourwage values: 30.10 and 27.10.

Challenges & Limitation

To achieve \(l = 2\) diversity across the entire dataset, simply generalizing or suppressing information is often insufficient. The first equivalence class lacks enough variation in the sensitive attribute (hourwage) to meet the l-diversity requirement. Addressing this would require either eliminating the class or altering the data significantly, which could compromise the utility of the dataset. This example demonstrates that while k-anonymity provides a baseline level of protection, it does not fully safeguard against attribute disclosure. Just as with k-anonymity, achieving l-diversity requires more complex strategies and often involves trade-offs between maintaining data utility and ensuring privacy.

Example: Suppose we augment our dataset to include a new sensitive attribute: Disease. This updated dataset maintains \(k = 3\) anonymity, as for every combination of quasi-identifiers, there are at least 3 records that share the same set of values. Additionally, the dataset achieves \(l = 2\) diversity because each equivalence class contains at least two distinct values for both sensitive attributes.

# Create a dummy dataframe with 9 rows and 3 equivalence classes, including the 'Disease' column
df_dummy <- data.frame(
  Region = c(
    "Northern Virginia", "Northern Virginia", "Northern Virginia",
    "Central Virginia", "Central Virginia", "Central Virginia",
    "Eastern Virginia", "Eastern Virginia", "Eastern Virginia"
  ),
  sex = c(
    "Female", "Female", "Female",
    "Female", "Female", "Female",
    "Male", "Male", "Male"
  ),
  race = c(
    "White", "White", "White",
    "Asian", "Asian", "Asian",
    "Black", "Black", "Black"
  ),
  age_group = c(
    "30-49", "30-49", "30-49",
    "40-59", "40-59", "40-59",
    "10-29", "10-29", "10-29"
  ),
  education_group = c(
    "Bachelor's Degree", "Bachelor's Degree", "Bachelor's Degree",
    "High School", "High School", "High School",
    "Master's Degree", "Master's Degree", "Master's Degree"
  ),
  hourwage = c(
    25.00, 26.00, 25.00,
    16.75, 14.00, 17.20,
    30.10, 27.10, 40.00
  ),
  Disease = c(
    "Diabetes", "Heart Disease", "Heart Disease",
    "Gastritis", "Stomach Cancer", "Gastric Ulcer",
    "Alopecia", "Diabetes", "Alopecia"
  )
)

head(df_dummy, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

Region	sex	race	age_group	education_group	hourwage	Disease
Northern Virginia	Female	White	30-49	Bachelor’s Degree	25.00	Diabetes
Northern Virginia	Female	White	30-49	Bachelor’s Degree	26.00	Heart Disease
Northern Virginia	Female	White	30-49	Bachelor’s Degree	25.00	Heart Disease
Central Virginia	Female	Asian	40-59	High School	16.75	Gastritis
Central Virginia	Female	Asian	40-59	High School	14.00	Stomach Cancer
Central Virginia	Female	Asian	40-59	High School	17.20	Gastric Ulcer
Eastern Virginia	Male	Black	10-29	Master’s Degree	30.10	Alopecia
Eastern Virginia	Male	Black	10-29	Master’s Degree	27.10	Diabetes
Eastern Virginia	Male	Black	10-29	Master’s Degree	40.00	Alopecia

Attribute Disclosure:

One of the major challenges of ensuring data privacy, even after implementing l-diversity, is the risk of attribute disclosure. Let’s consider a scenario where an attacker has some auxiliary information about an individual in the dataset. For example, suppose the attacker knows that Emily is an Asian female with a high school education. Even if the database has achieved \(k=3\) anonymity and \(l=2\) diversity, the attacker can still make concerning inferences.Despite not knowing Emily’s exact hourly wage, the attacker can confidently deduce that Emily earns less than 18 dollars per hour. This is because, within Emily’s equivalence class (Central Virginia, Asian, Female, age 40-59, High School), all wage values are below this threshold. Furthermore, the attacker can also infer that Emily has a health issue related to the stomach, as the only diseases represented within this class are “Gastritis,” “Stomach Cancer,” and “Gastric Ulcer.” This is what is known as a homogeneity attack.

t-closeness

While k-anonymity and l-diversity aim to protect against identity and attribute disclosure, respectively, they may still leave datasets vulnerable to more sophisticated inference attacks. Even with l-diversity, if the distribution of sensitive attributes within an equivalence class is vastly different from the overall distribution, attackers with background knowledge could infer sensitive information. This vulnerability is addressed through t-closeness. .

Definition: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole dataset does not exceed a specified threshold \(t\).

– A table/dataset is said to have t-closeness if all equivalence classes have t-closeness.

Earth Moving Distance

The Earth Mover’s Distance (EMD) measures the minimum “cost” to transform one distribution into another. This cost is calculated by summing over the absolute differences of the cumulative probabilities from each distribution up to each point \(i\), normalized by the total number of elements minus one (\(m - 1\)). By calculating the EMD between the distribution of the sensitive attribute in each equivalence class and the distribution in the full dataset, we can quantitatively measure how much the anonymized data deviates from the original distribution.

The provided formula for EMD is expressed as:

\[ E(P, Q) = \frac{1}{m - 1} \sum_{i=1}^m \left| \sum_{j=1}^i (p_j - q_j) \right| \]

where: - \(m\) is the number of elements in the distributions \(P\) and \(Q\). - \(p_j\) and \(q_j\) are the probabilities at position \(j\) in distributions \(P\) and \(Q\), respectively.

Setting a Threshold

Choosing an appropriate threshold \(t\) for t-closeness is crucial to ensuring the effectiveness of this privacy-preserving method. The threshold \(t\) determines the maximum allowable distance between the distribution of a sensitive attribute within each equivalence class and its distribution across the entire dataset. A smaller \(t\) value indicates a stricter privacy constraint.

Unfortunately, “[a]t this point, no universal threshold for t appears to exist.” The authors mention that “the t-closeness principle can be applied using other distance measures. While EMD is the best measure we have found so far, it is certainly not perfect. In particular, the relationship between the value t and information gain is unclear.” Because of this, it is crucial to consider the following aspects:

1. Data Sensitivity: Highly sensitive data might necessitate a lower threshold to minimize the risk of revealing personal information through statistical inference.
1. Data Distribution: Analyze the overall distribution of the sensitive attribute; a skewed distribution may require a stricter threshold.
1. Regulatory Compliance: Ensure compliance with relevant data protection laws, which may influence the minimum acceptable threshold.
1. Privacy vs. Data Utility: Balance the need for data privacy with the utility of the anonymized data. A lower threshold enhances privacy but may reduce utility.
1. Iterative Refinement: Continuously monitor and adjust the threshold based on new data, legal changes, or advances in privacy research.

Step-by-Step Implementation

1. Calculate Overall Distribution of Sensitive Attributes:

Now that we understand the purpose of t-closeness, the first step in implementing t-closeness to our dataset is to determine the distribution of each sensitive attribute, in this case hourwage and diseae across the entire dataset. For the purposes of this example, we will focus on only hourwage attribute and not disease attribute as numeric data. A similar process should be applied to categorical data.

overall_distribution_hourwage <- df_dummy %>%
  summarise(mean_hourwage = mean(hourwage, na.rm = TRUE),
            sd_hourwage = sd(hourwage, na.rm = TRUE))

head(overall_distribution_hourwage, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

mean_hourwage	sd_hourwage
24.57222	7.942572

overall_distribution_disease <- df_dummy %>%
  count(Disease) %>%
  mutate(proportion = n / sum(n))

head(overall_distribution_disease, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

Disease	n	proportion
Alopecia	2	0.2222222
Diabetes	2	0.2222222
Gastric Ulcer	1	0.1111111
Gastritis	1	0.1111111
Heart Disease	2	0.2222222
Stomach Cancer	1	0.1111111

2. Calculate Distribution Within Each Equivalence Class:

For each group of records that share the same quasi-identifiers (e.g., Region, sex, race, age_group, education_group), compute the distribution of the sensitive attributes.

equivalence_classes <- df_dummy %>%
  group_by(Region, sex, race, age_group, education_group) %>%
  summarise(mean_hourwage = mean(hourwage, na.rm = TRUE),
            sd_hourwage = sd(hourwage, na.rm = TRUE),
            disease_distribution = list(table(Disease) / length(Disease)),
            .groups = "drop")

head(equivalence_classes, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

Region	sex	race	age_group	education_group	mean_hourwage	sd_hourwage	disease_distribution
Central Virginia	Female	Asian	40-59	High School	15.98333	1.7322914	0.3333333, 0.3333333, 0.3333333
Eastern Virginia	Male	Black	10-29	Master’s Degree	32.40000	6.7505555	0.6666667, 0.3333333
Northern Virginia	Female	White	30-49	Bachelor’s Degree	25.33333	0.5773503	0.3333333, 0.6666667

3. Compare Distributions Using a Distance Metric:

Use a distance metric (like the Earth Mover’s Distance or Jensen-Shannon Divergence) to quantify how different the distributions within each equivalence class are from the overall distribution. For simplicity, let’s calculate the absolute difference in means:

equivalence_classes <- equivalence_classes %>%
  mutate(hourwage_diff = abs(mean_hourwage - overall_distribution_hourwage$mean_hourwage))

head(equivalence_classes, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

Region	sex	race	age_group	education_group	mean_hourwage	sd_hourwage	disease_distribution	hourwage_diff
Central Virginia	Female	Asian	40-59	High School	15.98333	1.7322914	0.3333333, 0.3333333, 0.3333333	8.5888889
Eastern Virginia	Male	Black	10-29	Master’s Degree	32.40000	6.7505555	0.6666667, 0.3333333	7.8277778
Northern Virginia	Female	White	30-49	Bachelor’s Degree	25.33333	0.5773503	0.3333333, 0.6666667	0.7611111

4. Adjust the Data to Meet t-Closeness:

If the difference exceeds a predefined threshold \(t\), adjust the data. This might involve:

Smoothing the values of the sensitive attribute.
- For the Central Virginia equivalence class, we may increase the hourwage values slightly to bring the mean closer to 24.57. Given the low standard deviation, adjustments should be subtle to maintain the class’s internal consistency.
- For the Eastern Virginia class, we may decrease the hourwage values to make the mean closer to 24.57, while considering the existing variability to avoid over-smoothing.
- For the Northern Virginia class, minimal or no adjustment is needed since the hourwage mean is already close to the overall mean.
Grouping records further or making other data transformations to balance the distributions.

overall_mean <- overall_distribution_hourwage$mean_hourwage
t <- 0.1  # 10% threshold

equivalence_classes <- equivalence_classes %>%
  mutate(t_closeness = ifelse(hourwage_diff <= overall_mean * t, TRUE, FALSE))

# Install the transport package
#install.packages("transport")

# Load the package
library(transport)

## Warning: package 'transport' was built under R version 4.3.3

# Overall wage distribution
overall_wage <- df_dummy$hourwage

# Initialize an empty list to store results
results <- list()

# Create a unique identifier for each group based on quasi-identifiers
df_dummy$GroupID <- with(df_dummy, interaction(Region, sex, race, age_group, education_group))

# Unique groups
unique_groups <- unique(df_dummy$GroupID)

# Loop over each group to calculate EMD
for (group_id in unique_groups) {
  subgroup <- df_dummy[df_dummy$GroupID == group_id, ]
  subgroup_wage <- subgroup$hourwage
  
  # Calculate Wasserstein distance
  emd_value <- wasserstein1d(subgroup_wage, overall_wage)
  
  # Store results
  results[[as.character(group_id)]] <- emd_value
}

# Display results
results

## $`Northern Virginia.Female.White.30-49.Bachelor's Degree`
## [1] 5.25
## 
## $`Central Virginia.Female.Asian.40-59.High School`
## [1] 8.588889
## 
## $`Eastern Virginia.Male.Black.10-29.Master's Degree`
## [1] 7.827778

# Example Data
df_data <- data.frame(
  ZIP_Code = c("4767*", "4767*", "4767*", "4790*", "4790*", "4790*", "4760*", "4760*", "4760*"),
  Age = c("<= 40", "<= 40", "<= 40", ">= 40", ">= 40", ">= 40", "<= 40", "<= 40", "<= 40"),
  Salary = c("3K", "5K", "9K", "6K", "11K", "8K", "4K", "7K", "10K"),
  Disease = c("gastric ulcer", "stomach cancer", "pneumonia", "gastritis", "flu", "bronchitis", "gastritis", "bronchitis", "stomach cancer")
)

# Convert Salary to numeric
df_data$Salary <- as.numeric(sub("K", "", df_data$Salary)) * 1000

table(df_data$Salary) / nrow(df_data)

## 
##      3000      4000      5000      6000      7000      8000      9000     10000 
## 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 
##     11000 
## 0.1111111

create_auto_distributions <- function(data, attribute) {
  # Determine unique values in the specified attribute to form equivalence classes
  unique_values <- unique(data[[attribute]])
  
  # Initialize list to hold distributions
  distributions <- list()
  
  # Total number of elements
  n <- nrow(data)
  
  # Loop through each unique value to create distributions
  for (value in unique_values) {
    # Create a zero vector of length n
    distribution <- numeric(n)
    
    # Find rows where the attribute matches the current unique value
    indices <- which(data[[attribute]] == value)
    
    # Assign equal probabilities to these indices
    distribution[indices] <- 1 / length(indices)
    
    # Name the distribution based on the unique value
    distributions[[paste(attribute, value, sep = "_")]] <- distribution
  }
  
  # Return the list of distributions
  return(distributions)
}

create_auto_distributions(df_data, "Salary")

## $Salary_3000
## [1] 1 0 0 0 0 0 0 0 0
## 
## $Salary_5000
## [1] 0 1 0 0 0 0 0 0 0
## 
## $Salary_9000
## [1] 0 0 1 0 0 0 0 0 0
## 
## $Salary_6000
## [1] 0 0 0 1 0 0 0 0 0
## 
## $Salary_11000
## [1] 0 0 0 0 1 0 0 0 0
## 
## $Salary_8000
## [1] 0 0 0 0 0 1 0 0 0
## 
## $Salary_4000
## [1] 0 0 0 0 0 0 1 0 0
## 
## $Salary_7000
## [1] 0 0 0 0 0 0 0 1 0
## 
## $Salary_10000
## [1] 0 0 0 0 0 0 0 0 1

# Define the distributions for uniform and equivalence classes
Q3 <- rep(1/9, 9)
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
P32 <- c(0, 0, 0, 1/3, 0, 1/3, 0, 0, 1/3)
P33 <- c(0, 0, 0, 0, 1/3, 0, 1/3, 1/3, 0)

# Define the distributions
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
Q3 <- rep(1/9, 9)

# Calculate EMD using the defined formula
emd_function <- function(P, Q) {
  m <- length(P)  # number of positions in the distributions
  abs_cumulative_diffs <- abs(cumsum(P - Q))  # absolute cumulative differences
  emd <- sum(abs_cumulative_diffs) / (m - 1)  # EMD calculation
  return(emd)
}

# Calculate EMD for P31 vs. Q3
emd_P31 <- emd_function(P31, Q3)
emd_P32 <- emd_function(P32, Q3)
emd_P33 <- emd_function(P33, Q3)

print(sprintf("EMD P3,1 vs. Q3: %f", emd_P31))

## [1] "EMD P3,1 vs. Q3: 0.375000"

print(sprintf("EMD P3,2 vs. Q3: %f", emd_P32))

## [1] "EMD P3,2 vs. Q3: 0.166667"

print(sprintf("EMD P3,3 vs. Q3: %f", emd_P33))

## [1] "EMD P3,3 vs. Q3: 0.236111"

# Parameters
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
Q3 <- rep(1/9, 9)

# Compute differences
differences <- P31 - Q3

# Compute cumulative sums of differences to find how much mass to move
cum_diff <- cumsum(differences)

# The total work is the sum of absolute cumulative differences
# (except the last one, because it should theoretically be zero if all previous are correct)
total_work <- sum(abs(cum_diff[-length(cum_diff)]))

# Since we move masses and we need to consider the distance moved, we calculate the EMD
distances <- 1:(length(P31) - 1)  # distances from each position to the next
emd_value <- sum(abs(cum_diff[-length(cum_diff)]) * distances)

# Normalize by the total mass moved
emd_value_normalized <- emd_value / sum(P31)

# Print the EMD result
cat("EMD for P3,1 vs. Q3:", emd_value_normalized, "\n")

## EMD for P3,1 vs. Q3: 12

Fair Relabelling

Let’s now turn to the discrimination aspect of fairness. How can we measure that a historical dataset contains a bias against a sensitive group, and how can we change the dataset in a preprocessing step so as to remove this measured bias from the dataset?

Measuing Fairness of a Dataset

Positive outcome: An outcome we want to balance, like median hourly wage.
Sensitive group (S): Group that might face discrimination (e.g., Black).
Non-sensitive group (NS): Group that may have an advantage (e.g., White).

Step-by-step Implementation

For this example, let’s assume df2 contains the data with columns race and hourwage.

# Grab the race and hourwage columns from df2
df_race_outcome <- df2 %>%
  select(race, hourwage)

Step 1: Create the outcome variable Y based on the median hourwage

# Calculate the median of hourwage
median_hourwage <- mean(df_race_outcome$hourwage, na.rm = TRUE)

# Create the outcome variable Y: "+" if hourwage >= median_hourwage, "-" otherwise
df_race_outcome <- df_race_outcome %>%
  mutate(Y = ifelse(hourwage >= median_hourwage, 1, 0))

# View the resulting dataframe
head(df_race_outcome, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )

race	hourwage	Y
White	11.49	0
White	14.50	0
White	9.00	0
White	7.50	0
White	7.25	0
White	10.42	0
White	13.00	0
White	12.00	0
White	28.75	1
White	13.00	0

Step 2: Calculate the number of positive outcomes for non-sensitive (White) and sensitive (Black) groups

ns_positive <- df_race_outcome %>% filter(race == "White" & Y == 1) %>% nrow()  # Count of positive outcomes for 'White'
s_positive <- df_race_outcome %>% filter(race == "Black" & Y == 1) %>% nrow()   # Count of positive outcomes for 'Black'

# Print the results
cat("Number of positive outcomes for White:", ns_positive, "\n")

## Number of positive outcomes for White: 69361

cat("Number of positive outcomes for Black:", s_positive, "\n")

## Number of positive outcomes for Black: 7142

Step 3: Calculate the total number of instances for non-sensitive (White) and sensitive (Black) groups

ns_total <- df_race_outcome %>% filter(race == "White") %>% nrow()  # Count of positive outcomes for 'White'
s_total <- df_race_outcome %>% filter(race == "Black") %>% nrow()   # Count of positive outcomes for 'Black'

# Print the values
cat("Total instances for White group (ns_total):", ns_total, "\n")

## Total instances for White group (ns_total): 184338

cat("Total instances for Black group (s_total):", s_total, "\n")

## Total instances for Black group (s_total): 26113

Step 4: Calculate the probabilities

P_ns_positive <- ns_positive / ns_total  # Probability of positive outcome for 'ns'
P_s_positive <- s_positive / s_total      # Probability of positive outcome for 's'
# Print the probabilities
cat("Probability of positive outcome for White (P_ns_positive):", P_ns_positive, "\n")

## Probability of positive outcome for White (P_ns_positive): 0.3762708

cat("Probability of positive outcome for Black (P_s_positive):", P_s_positive, "\n")

## Probability of positive outcome for Black (P_s_positive): 0.2735036

Step 5: Calculate Statistical Parity and Disparate Impact

statistical_parity <- P_ns_positive - P_s_positive  # Absolute difference in probabilities
disparate_impact <- P_ns_positive / P_s_positive    # Ratio of probabilities

cat("Statistical Parity:", statistical_parity, "\n")

## Statistical Parity: 0.1027671

cat("Disparate Impact:", disparate_impact, "\n")

## Disparate Impact: 1.375743

Statistical Parity measures the absolute difference in the probability of a positive outcome between the non-sensitive (White) and sensitive (Black) groups. Here, a value of 0.1028 means that White individuals have about a 10.3% higher probability of receiving a positive outcome compared to Black individuals.
Disparate Impact measures the ratio of the probability of a positive outcome between the two groups. A value of 1.376 indicates that White individuals are about 1.38 times more likely to receive a positive outcome than Black individuals.

These results suggest that there might be a bias in the dataset favoring White individuals over Black individuals for positive outcomes. However, disparate impact t doesn’t consider too many other factors to matter. Thus, further examination of the data must be to asses the fairness more accurately.

S### Massaging

Massaging involves relabeling a specific number of instances to balance outcomes across groups.

To calculate the number of label changes \(M\) required to balance your dataset, you can use the formula:

\[ M = \frac{\text{disc}(D) \cdot s_T \cdot ns_T}{s_T + ns_T} \]

where:

disc(D): is the measured discrimination in the dataset (Statistical Parity).
s_T: is the total count of individuals in the sensitive group.
ns_T: is the total count of individuals in the non-sensitive group.

Let’s substitute your calculated values into this formula:

disc(D) (Statistical Parity): 0.1028
s_T (Total for Black group): 26,113
ns_T (Total for White group): 184,338

\[ M = \frac{0.1028 \times 26113 \times 184338}{26113 + 184338} \]

The result for \(M\) will provide the number of label changes needed to reduce observed bias in the dataset by switching outcomes for individuals in both the sensitive and non-sensitive groups.

M <- (statistical_parity * s_total * ns_total) / (s_total + ns_total)

# Print the result for M
cat("Number of label changes required (M):", M, "\n")

## Number of label changes required (M): 2350.579

Ethical Data Preprocessing in Data Science Projects: A Step-by-Step Guide

Luz Melo

2024-10-29

Exploratory Data Analysis

Defining and Measuring Privacy

k-anonymity

Suppressing, Grouping, and Perturbing

Step-by-Step Implementation

Supressing

Grouping

Perturbing

Challenges & Limitations

l-diversity

Challenges & Limitation

t-closeness

Earth Moving Distance

Setting a Threshold

Step-by-Step Implementation

Fair Relabelling

Measuing Fairness of a Dataset

Step-by-step Implementation