df1 = read.csv("C:\\Users\\luzya\\OneDrive\\Escritorio\\PhD program\\Fall 2024\\DS 8998 - Farhana\\df_ex2.csv")
df2 = read.csv("C:\\Users\\luzya\\OneDrive\\Escritorio\\PhD program\\Fall 2024\\DS 8998 - Farhana\\df_ex1.csv")
head(df2, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| id | Zip_Code | sex | race | age | education | hourwage |
|---|---|---|---|---|---|---|
| ID0008 | 23221 | Female | White | 30 | Associate’s degree, occupational/vocational program | 11.49 |
| ID0009 | 23460 | Male | White | 33 | Associate’s degree, occupational/vocational program | 14.50 |
| ID0077 | 22153 | Male | White | 60 | High school diploma or equivalent | 9.00 |
| ID0080 | 22303 | Female | White | 41 | High school diploma or equivalent | 7.50 |
| ID0081 | 22181 | Female | White | 17 | Grade 11 | 7.25 |
| ID0083 | 24015 | Female | White | 31 | Associate’s degree, academic program | 10.42 |
| ID0084 | 23832 | Male | White | 31 | High school diploma or equivalent | 13.00 |
| ID0088 | 23603 | Male | White | 53 | Grade 11 | 12.00 |
| ID0089 | 23323 | Male | White | 45 | Some college but no degree | 28.75 |
| ID0090 | 22303 | Female | White | 38 | Professional school degree | 13.00 |
# Identifying distinct entries for the variables
distinct_entries <- df2 %>%
summarise(
age = n_distinct(age),
education = n_distinct(education),
sex = n_distinct(sex),
race = n_distinct(race),
Zip_Code = n_distinct(Zip_Code)
)
distinct_entries
## age education sex race Zip_Code
## 1 67 16 2 2 226
ggplot(df2,aes(x=hourwage)) +
geom_histogram(color="Black",fill="skyblue")+
labs(title = 'Histogram of Hourly Wage', x = 'Hourly Wage', y = 'Frequency')+
theme(plot.title = element_text(hjust=0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
race_counts <- table(df2$race)
# Calculate percentages
race_percentages <- prop.table(race_counts) * 100
# Combine counts and percentages into a data frame for better readability
race_table <- data.frame(
Count = race_counts,
Percentage = sprintf("%.2f%%", race_percentages) # Format as percentage
)
print("Table for Race Distribution:")
## [1] "Table for Race Distribution:"
print(race_table)
## Count.Var1 Count.Freq Percentage
## 1 Black 26113 12.41%
## 2 White 184338 87.59%
library(table1)
table1::label(df2$age) <- "Age"
table1::label(df2$hourwage) <- "Hourly Wage"
table1::label(df2$education) <- "Education Level"
table1::label(df2$sex) <- "Sex"
table1::table1(~ age + hourwage + sex + education | race, data = df2)
| Black (N=26113) |
White (N=184338) |
Overall (N=210451) |
|
|---|---|---|---|
| Age | |||
| Mean (SD) | 40.8 (14.7) | 40.8 (15.4) | 40.8 (15.3) |
| Median [Min, Max] | 40.0 [15.0, 85.0] | 40.0 [15.0, 85.0] | 40.0 [15.0, 85.0] |
| Hourly Wage | |||
| Mean (SD) | 17.3 (9.33) | 19.6 (11.0) | 19.3 (10.8) |
| Median [Min, Max] | 15.0 [1.00, 100] | 16.8 [0.0100, 100] | 16.5 [0.0100, 100] |
| Sex | |||
| Female | 14696 (56.3%) | 92721 (50.3%) | 107417 (51.0%) |
| Male | 11417 (43.7%) | 91617 (49.7%) | 103034 (49.0%) |
| Education Level | |||
| 12th grade, no diploma | 515 (2.0%) | 2716 (1.5%) | 3231 (1.5%) |
| Associate's degree, academic program | 1784 (6.8%) | 13227 (7.2%) | 15011 (7.1%) |
| Associate's degree, occupational/vocational program | 1244 (4.8%) | 10943 (5.9%) | 12187 (5.8%) |
| Bachelor's degree | 3296 (12.6%) | 30087 (16.3%) | 33383 (15.9%) |
| Doctorate degree | 130 (0.5%) | 988 (0.5%) | 1118 (0.5%) |
| Grade 10 | 439 (1.7%) | 3959 (2.1%) | 4398 (2.1%) |
| Grade 11 | 870 (3.3%) | 5637 (3.1%) | 6507 (3.1%) |
| Grade 9 | 240 (0.9%) | 2986 (1.6%) | 3226 (1.5%) |
| Grades 1, 2, 3, or 4 | 67 (0.3%) | 955 (0.5%) | 1022 (0.5%) |
| Grades 5 or 6 | 93 (0.4%) | 2192 (1.2%) | 2285 (1.1%) |
| Grades 7 or 8 | 163 (0.6%) | 2172 (1.2%) | 2335 (1.1%) |
| High school diploma or equivalent | 9985 (38.2%) | 62735 (34.0%) | 72720 (34.6%) |
| Master's degree | 1015 (3.9%) | 7264 (3.9%) | 8279 (3.9%) |
| None or preschool | 47 (0.2%) | 371 (0.2%) | 418 (0.2%) |
| Professional school degree | 96 (0.4%) | 775 (0.4%) | 871 (0.4%) |
| Some college but no degree | 6129 (23.5%) | 37331 (20.3%) | 43460 (20.7%) |
K-anonymity ensures that each individual in a dataset cannot be distinguished from at least k-1 other individuals with respect to the quasi-identifiers in the dataset. This is done through grouping and suppression. Applying k-anonymity makes it more difficult for an attacker to re-identify specific individuals in the dataset.
Vocabulary
Identifiers (also known as key attributes): direct identifiers such as names, student numbers, email addresses, etc. These variables should in principle not be collected at all, or removed from the dataset if they are not necessary for your research project.
Quasi-identifiers: indirect identifiers that can lead to identification when combined with other quasi-identifiers in the dataset or external information. These are often demographic variables like age, sex, place of residence, etc., but could also be something entirely different like physical characteristics, timestamps, etc. In general, quasi-identifiers are usually variables that are likely to be known to someone in the outside world.
Sensitive attributes: variables of interest which should be protected, and which cannot be changed, because they are the main outcome variables. For example, it can be Medical condition in a healthcare dataset, or Income in a financial dataset.
It is important to correctly categorise the variables in your dataset as any of these variable types if you want to apply k-anonymity because they will determine how the dataset will be de-identified. To make a dataset k-anonymous, you must first identify which variables in the dataset are identifiers, quasi-identifiers, and sensitive attributes.
In the example above, we have that
idzip_code,
age, sex, race, and
educationhourwageNext, you should set a value for k. If we choose a k of 2, every row in the dataset should have the same combination of quasi-identifiers as at least 1 other row in the dataset. Finally, you aggregate the dataset so that every combination of quasi-identifiers occurs at least k times.
# Applying suppression by removing unique identifier
df2$id <- NULL
head(df2, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| Zip_Code | sex | race | age | education | hourwage |
|---|---|---|---|---|---|
| 23221 | Female | White | 30 | Associate’s degree, occupational/vocational program | 11.49 |
| 23460 | Male | White | 33 | Associate’s degree, occupational/vocational program | 14.50 |
| 22153 | Male | White | 60 | High school diploma or equivalent | 9.00 |
| 22303 | Female | White | 41 | High school diploma or equivalent | 7.50 |
| 22181 | Female | White | 17 | Grade 11 | 7.25 |
| 24015 | Female | White | 31 | Associate’s degree, academic program | 10.42 |
| 23832 | Male | White | 31 | High school diploma or equivalent | 13.00 |
| 23603 | Male | White | 53 | Grade 11 | 12.00 |
| 23323 | Male | White | 45 | Some college but no degree | 28.75 |
| 22303 | Female | White | 38 | Professional school degree | 13.00 |
df3 = df1
df3$age_group <- cut(df1$age, breaks=c(10, 20, 30, 40, 50, 60, 70, 80, 90), right=FALSE, labels=c("10s", "20s", "30s", "40s", "50s", "60s", "70s", "80s"))
df3$age <- NULL
df3$Zip_Code <- NULL
df3$id <- NULL
# Specify a new order where "hourwage" and "zip_code" are swapped
new_order <- c("County", "sex", "race", "age_group", "education", "hourwage")
# Reorder the dataframe columns according to the new order
df3 <- df3[, new_order]
# Apply this method after modelling (otherwise is classification)
#max_wage <- ceiling(max(df1$hourwage) / 5) * 5 # This rounds up to the nearest 5
#df1$wage_group <- cut(df1$hourwage, breaks=seq(0, max_wage, by=5), include.lowest=TRUE)
head(df4, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| County | sex | race | age_group | education_group | hourwage |
|---|---|---|---|---|---|
| Richmond | Female | White | 30s | Associate Degree or more | 11.49 |
| Virginia Beach | Male | White | 30s | Associate Degree or more | 14.50 |
| Springfield | Male | White | 60s | High school diploma or equivalent | 9.00 |
| Alexandria | Female | White | 40s | High school diploma or equivalent | 7.50 |
| Vienna | Female | White | 10s | Less than High School diploma | 7.25 |
| Roanoke | Female | White | 30s | Associate Degree or more | 10.42 |
| Chesterfield | Male | White | 30s | High school diploma or equivalent | 13.00 |
| Newport News | Male | White | 50s | Less than High School diploma | 12.00 |
| Chesapeake | Male | White | 40s | Some college but no degree | 28.75 |
| Alexandria | Female | White | 30s | Associate Degree or more | 13.00 |
Perturbing is a privacy-preserving technique that protects data by adding small amounts of random noise or modifying values in a controlled way. The goal of perturbing is to obscure individual data points, making it harder to re-identify individuals while retaining the dataset’s overall statistical properties.
In our dataset, this could involve adding or subtracting a small amount to the age variable instead of grouping. Since we have already performed generalization through grouping, we will not add perturbation to this variable.
table1::label(df4$age_group) <- "Age Group"
table1::label(df4$hourwage) <- "Hourly Wage"
table1::label(df4$education_group) <- "Education Group"
table1::label(df4$sex) <- "Sex"
table1::table1(~ age_group + hourwage + sex + education_group | race, data = df4)
| Black (N=26113) |
White (N=184338) |
Overall (N=210451) |
|
|---|---|---|---|
| Age Group | |||
| 10s | 1315 (5.0%) | 11807 (6.4%) | 13122 (6.2%) |
| 20s | 6001 (23.0%) | 43024 (23.3%) | 49025 (23.3%) |
| 30s | 5529 (21.2%) | 37103 (20.1%) | 42632 (20.3%) |
| 40s | 5219 (20.0%) | 33303 (18.1%) | 38522 (18.3%) |
| 50s | 4913 (18.8%) | 33679 (18.3%) | 38592 (18.3%) |
| 60s | 2531 (9.7%) | 20418 (11.1%) | 22949 (10.9%) |
| 70s | 509 (1.9%) | 4288 (2.3%) | 4797 (2.3%) |
| 80s | 96 (0.4%) | 716 (0.4%) | 812 (0.4%) |
| Hourly Wage | |||
| Mean (SD) | 17.3 (9.33) | 19.6 (11.0) | 19.3 (10.8) |
| Median [Min, Max] | 15.0 [1.00, 100] | 16.8 [0.0100, 100] | 16.5 [0.0100, 100] |
| Sex | |||
| Female | 14696 (56.3%) | 92721 (50.3%) | 107417 (51.0%) |
| Male | 11417 (43.7%) | 91617 (49.7%) | 103034 (49.0%) |
| Education Group | |||
| Associate Degree or more | 7565 (29.0%) | 63284 (34.3%) | 70849 (33.7%) |
| High school diploma or equivalent | 9985 (38.2%) | 62735 (34.0%) | 72720 (34.6%) |
| Less than High School diploma | 2434 (9.3%) | 20988 (11.4%) | 23422 (11.1%) |
| Some college but no degree | 6129 (23.5%) | 37331 (20.3%) | 43460 (20.7%) |
# Identifying distinct entries for the variables
distinct_entries <- df4 %>%
summarise(
age_group = n_distinct(age_group),
education = n_distinct(education_group),
sex = n_distinct(sex),
race = n_distinct(race),
County = n_distinct(County)
)
distinct_entries
## age_group education sex race County
## 1 8 4 2 2 52
Observations:
The age group is now represented by only 7 ranges, instead of being a continuous variable. This reduces the number of distict ages from 67 to 7.
The education variable now has only 4 levels, instead of 16.
The Zip_Code variable had 226 distinct values. After generalization, this number got reduced to 52.
In k-anonymity, each unique combination of quasi-identifiers must appear at least k times in the dataset to ensure that individuals cannot be easily re-identified. The data frame below shows each distinct set of quasi-identifiers, along with a count indicating how many records share that exact combination. This dataframe tell us that there are 5,773 possible combinations of quasi-identifiers in our dataset.
# Count the number of occurrences of each combination of quasi-identifiers
combinations <- df4 %>%
group_by(County, sex, race, age_group, education_group) %>%
summarise(count = n(), .groups = "drop")
# Display combinations and their counts
head(combinations, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| County | sex | race | age_group | education_group | count |
|---|---|---|---|---|---|
| Abingdon | Female | Black | 10s | High school diploma or equivalent | 2 |
| Abingdon | Female | Black | 10s | Less than High School diploma | 1 |
| Abingdon | Female | Black | 10s | Some college but no degree | 2 |
| Abingdon | Female | Black | 20s | Associate Degree or more | 6 |
| Abingdon | Female | Black | 20s | High school diploma or equivalent | 15 |
| Abingdon | Female | Black | 20s | Less than High School diploma | 1 |
| Abingdon | Female | Black | 20s | Some college but no degree | 8 |
| Abingdon | Female | Black | 30s | Associate Degree or more | 8 |
| Abingdon | Female | Black | 30s | High school diploma or equivalent | 11 |
| Abingdon | Female | Black | 30s | Less than High School diploma | 2 |
Out of those possible combinations, we wish to find how many combinations do not meet k-anonymality. To do this, we filter the dataset to only include the rows which count values are less than or equal to \(k=2\).
k = 2
# Find combinations that do not meet k-anonymity (where count < k)
unique_combinations <- combinations %>%
filter(count < k)
head(unique_combinations, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| County | sex | race | age_group | education_group | count |
|---|---|---|---|---|---|
| Abingdon | Female | Black | 10s | Less than High School diploma | 1 |
| Abingdon | Female | Black | 20s | Less than High School diploma | 1 |
| Abingdon | Female | Black | 70s | Associate Degree or more | 1 |
| Abingdon | Female | Black | 70s | High school diploma or equivalent | 1 |
| Abingdon | Female | Black | 80s | High school diploma or equivalent | 1 |
| Abingdon | Female | Black | 80s | Less than High School diploma | 1 |
| Abingdon | Female | White | 70s | Less than High School diploma | 1 |
| Abingdon | Female | White | 80s | Associate Degree or more | 1 |
| Abingdon | Male | Black | 30s | Less than High School diploma | 1 |
| Abingdon | Male | Black | 50s | Some college but no degree | 1 |
After filtering out the data to include only rows with count values less than 2, we notice that 573 rows, or about 10% of our data, does not satisfy k-anonymality. Thus, we applied additional generalization techniques in the dataset below. Instead of using 10-year age intervals, we broadened the age groups to 20-year ranges. Additionally, we reduced the geographical detail by grouping individuals into 6 broader regions (Northern Virginia, Central Virginia, Eastern Virginia, Western Virginia, Southwest Virginia, and Washington), rather than the original 52 distinct counties.
# Load dplyr
library(dplyr)
# Apply regional generalization
df5 <- df4 %>%
mutate(Region = case_when(
County %in% c("Alexandria", "Arlington", "Centreville", "Chantilly", "Fairfax", "Falls Church",
"Herndon", "Leesburg", "Mc Lean", "Manassas", "Reston", "Sterling",
"Vienna", "Warrenton", "Woodbridge") ~ "Northern Virginia",
County %in% c("Charlottesville", "Chester", "Chesterfield", "Fredericksburg", "Glen Allen",
"Henrico", "Mechanicsville", "Midlothian", "Richmond", "Spotsylvania",
"Stafford") ~ "Central Virginia",
County %in% c("Chesapeake", "Hampton", "Newport News", "Norfolk", "Portsmouth",
"Suffolk", "Virginia Beach", "Williamsburg", "Yorktown") ~ "Eastern Virginia",
County %in% c("Blacksburg", "Harrisonburg", "Lynchburg", "Winchester") ~ "Western Virginia",
County %in% c("Abingdon", "Bristol", "Danville", "Farmville", "Radford") ~ "Southwest Virginia",
County == "Washington" ~ "Washington",
TRUE ~ "Unknown"
))
df5$County <- NULL
# Specify a new order where "hourwage" and "zip_code" are swapped
new_order <- c("Region", "sex", "race", "age_group", "education_group", "hourwage")
# Generalize age to 20-year intervals
df5 <- df5 %>%
mutate(age_group = case_when(
age_group %in% c("10s", "20s") ~ "10-29",
age_group %in% c("30s", "40s") ~ "30-49",
age_group %in% c("50s", "60s") ~ "50-69",
age_group %in% c("70s", "80s") ~ "70-89",
TRUE ~ age_group # Catch any other categories, if needed
))
# Reorder the dataframe columns according to the new order
df5 <- df5[, new_order]
# Count the number of occurrences of each combination of quasi-identifiers
combinations <- df5 %>%
group_by(Region, sex, race, age_group, education_group) %>%
summarise(count = n(), .groups = "drop")
# Sort the combinations dataset by the 'count' column in ascending order
combinations_sorted <- combinations %>%
arrange(count)
# Display the sorted dataset
head(combinations_sorted, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| Region | sex | race | age_group | education_group | count |
|---|---|---|---|---|---|
| Southwest Virginia | Male | Black | 70-89 | Associate Degree or more | 1 |
| Southwest Virginia | Male | Black | 70-89 | Less than High School diploma | 1 |
| Washington | Female | Black | 70-89 | High school diploma or equivalent | 1 |
| Washington | Female | Black | 70-89 | Less than High School diploma | 1 |
| Washington | Male | Black | 70-89 | High school diploma or equivalent | 1 |
| Western Virginia | Female | Black | 70-89 | High school diploma or equivalent | 1 |
| Southwest Virginia | Female | Black | 70-89 | Less than High School diploma | 2 |
| Southwest Virginia | Female | Black | 70-89 | Some college but no degree | 2 |
| Unknown | Male | Black | 70-89 | Less than High School diploma | 2 |
| Washington | Male | Black | 10-29 | Less than High School diploma | 2 |
Despite efforts to achieve k-anonymity through suppression and
generalization in our example, the dataset still contains six unique
combinations of quasi-identifiers that fail to meet the k-anonymity
requirement. Ideally, each combination of Region,
sex, race, age_group, and
education_group should ideally occur at least \(k=2\) times to prevent re-identification of
individuals. However, the first six rows have a count of \(k-1\), indicating that these combinations
are still too specific and make individuals identifiable. This
introduces the following challenge:
High Granularity: Our dataset exhibits high granularity, meaning that the data is broken down into highly detailed and specific combinations of quasi-identifiers. This level of detail makes it extremely challenging to group rows into equivalence classes that satisfy k-anonymity. Even after generalizing the data further by broadening age groups and simplifying location details, the variety of attributes such as region or education level still resulted in six unique combinations. These unique rows, which constitute only 0.1% of the data, highlight a critical limitation: achieving perfect k-anonymity for every record can be exceptionally difficult in datasets with significant variability and diverse populations. Because of this, one could argue that the dataset achieves a practical level of k-anonymity, as the non-compliant portion is quite small.
Imbalance and Minority Representation: Another significant challenge comes from the representation of minority groups in the dataset. For example, a substantial portion of the quasi-identifier combinations that fail to meet k-anonymity are associated with individuals of the Black race. This is because, as a minority group, there are fewer instances of Black individuals in the dataset, making it less likely for them to share the same set of quasi-identifiers as others. Consequently, these individuals are more prone to being identifiable even after generalization and suppression.
The Trade-Off Dilemma: Let’s consider that we are still not satisfied and want our entire dataset to fully satisfy k-anonymity. Achieving this would require even more extensive suppression and generalization. For instance, we might need to group individuals into broader age ranges spanning 30 years and simplify educational levels to just two categories: “More than a Bachelor’s Degree” and “Bachelor’s Degree or Less.” While this would reduce the number of unique equivalence classes and make it easier to satisfy k-anonymity, it comes at a cost. However, if too many attributes need to be suppressed, the data can become meaningless. Therefore, achieving k-anonymity requires balancing data utility with privacy.
k-Anonymity is a widely used method to protect against identity disclosure by ensuring that each record is indistinguishable from at least \(k-1\) others based on quasi-identifiers. However, it falls short in preventing attribute disclosure. Attribute disclosure occurs when an attacker can infer sensitive information about an individual even if the individual’s exact identity is protected. This situation arises particularly when all records in a k-anonymous group share the same sensitive attribute value, making it easy for attackers to draw confident inferences.
To address this shortcoming, l-diversity extends k-anonymity by adding another layer of protection. It requires that sensitive attributes within each equivalence class have at least \(l\) well-represented distinct values. This means that even if an attacker knows the quasi-identifiers of an individual, they cannot easily determine the sensitive attribute. More formaly, the book defines it as follows:
Definition l-diversity: ‘A property of a dataset where for each equivalence class in the dataset, there are at least l well-represented values for the sensitive attribute.’
Example: Consider this simplified dataset that has achieved \(k=3\) anonymity:
# Create a dummy dataframe with 9 rows and 3 equivalence classes
df_dummy <- data.frame(
Region = c(
"Northern Virginia", "Northern Virginia", "Northern Virginia",
"Central Virginia", "Central Virginia", "Central Virginia",
"Eastern Virginia", "Eastern Virginia", "Eastern Virginia"
),
sex = c(
"Female", "Female", "Female",
"Female", "Female", "Female",
"Male", "Male", "Male"
),
race = c(
"White", "White", "White",
"Asian", "Asian", "Asian",
"Black", "Black", "Black"
),
age_group = c(
"30-49", "30-49", "30-49",
"40-59", "40-59", "40-59",
"10-29", "10-29", "10-29"
),
education_group = c(
"Bachelor's Degree", "Bachelor's Degree", "Bachelor's Degree",
"High School", "High School", "High School",
"Master's Degree", "Master's Degree", "Master's Degree"
),
hourwage = c(
25.00, 25.00, 25.00,
16.75, 14.00, 17.20,
30.10, 27.10, 30.10
)
)
head(df_dummy, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| Region | sex | race | age_group | education_group | hourwage |
|---|---|---|---|---|---|
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 25.00 |
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 25.00 |
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 25.00 |
| Central Virginia | Female | Asian | 40-59 | High School | 16.75 |
| Central Virginia | Female | Asian | 40-59 | High School | 14.00 |
| Central Virginia | Female | Asian | 40-59 | High School | 17.20 |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 30.10 |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 27.10 |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 30.10 |
Analysis of Diversity
In this dataset, there are three equivalence classes:
First Class: (Northern Virginia, White, Female, age 30-49, Bachelor’s Degree)
hourwage value of 25.00. Thus, it is l = 1 diverse. If an
attacker knows that Maria is in the database and is a 37-year-old White
female with a Bachelor’s degree from Northern Virginia, they can
confidently infer that she earns 25 dollars an hour.Second Class: (Central Virginia, Asian, Female, age 40-59, High School)
hourwage values: 16.75, 14.00, and 17.20.Third Class: (Eastern Virginia, Black, Male, age 10-29, Master’s Degree)
hourwage values: 30.10 and 27.10.To achieve \(l = 2\) diversity across the entire dataset, simply generalizing or suppressing information is often insufficient. The first equivalence class lacks enough variation in the sensitive attribute (hourwage) to meet the l-diversity requirement. Addressing this would require either eliminating the class or altering the data significantly, which could compromise the utility of the dataset. This example demonstrates that while k-anonymity provides a baseline level of protection, it does not fully safeguard against attribute disclosure. Just as with k-anonymity, achieving l-diversity requires more complex strategies and often involves trade-offs between maintaining data utility and ensuring privacy.
Example: Suppose we augment our dataset to include a
new sensitive attribute: Disease. This updated dataset
maintains \(k = 3\) anonymity, as for
every combination of quasi-identifiers, there are at least 3 records
that share the same set of values. Additionally, the dataset achieves
\(l = 2\) diversity because each
equivalence class contains at least two distinct values for both
sensitive attributes.
# Create a dummy dataframe with 9 rows and 3 equivalence classes, including the 'Disease' column
df_dummy <- data.frame(
Region = c(
"Northern Virginia", "Northern Virginia", "Northern Virginia",
"Central Virginia", "Central Virginia", "Central Virginia",
"Eastern Virginia", "Eastern Virginia", "Eastern Virginia"
),
sex = c(
"Female", "Female", "Female",
"Female", "Female", "Female",
"Male", "Male", "Male"
),
race = c(
"White", "White", "White",
"Asian", "Asian", "Asian",
"Black", "Black", "Black"
),
age_group = c(
"30-49", "30-49", "30-49",
"40-59", "40-59", "40-59",
"10-29", "10-29", "10-29"
),
education_group = c(
"Bachelor's Degree", "Bachelor's Degree", "Bachelor's Degree",
"High School", "High School", "High School",
"Master's Degree", "Master's Degree", "Master's Degree"
),
hourwage = c(
25.00, 26.00, 25.00,
16.75, 14.00, 17.20,
30.10, 27.10, 40.00
),
Disease = c(
"Diabetes", "Heart Disease", "Heart Disease",
"Gastritis", "Stomach Cancer", "Gastric Ulcer",
"Alopecia", "Diabetes", "Alopecia"
)
)
head(df_dummy, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| Region | sex | race | age_group | education_group | hourwage | Disease |
|---|---|---|---|---|---|---|
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 25.00 | Diabetes |
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 26.00 | Heart Disease |
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 25.00 | Heart Disease |
| Central Virginia | Female | Asian | 40-59 | High School | 16.75 | Gastritis |
| Central Virginia | Female | Asian | 40-59 | High School | 14.00 | Stomach Cancer |
| Central Virginia | Female | Asian | 40-59 | High School | 17.20 | Gastric Ulcer |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 30.10 | Alopecia |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 27.10 | Diabetes |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 40.00 | Alopecia |
One of the major challenges of ensuring data privacy, even after implementing l-diversity, is the risk of attribute disclosure. Let’s consider a scenario where an attacker has some auxiliary information about an individual in the dataset. For example, suppose the attacker knows that Emily is an Asian female with a high school education. Even if the database has achieved \(k=3\) anonymity and \(l=2\) diversity, the attacker can still make concerning inferences.Despite not knowing Emily’s exact hourly wage, the attacker can confidently deduce that Emily earns less than 18 dollars per hour. This is because, within Emily’s equivalence class (Central Virginia, Asian, Female, age 40-59, High School), all wage values are below this threshold. Furthermore, the attacker can also infer that Emily has a health issue related to the stomach, as the only diseases represented within this class are “Gastritis,” “Stomach Cancer,” and “Gastric Ulcer.” This is what is known as a homogeneity attack.
While k-anonymity and l-diversity aim to protect against identity and attribute disclosure, respectively, they may still leave datasets vulnerable to more sophisticated inference attacks. Even with l-diversity, if the distribution of sensitive attributes within an equivalence class is vastly different from the overall distribution, attackers with background knowledge could infer sensitive information. This vulnerability is addressed through t-closeness. .
Definition: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole dataset does not exceed a specified threshold \(t\).
– A table/dataset is said to have t-closeness if all equivalence classes have t-closeness.
The Earth Mover’s Distance (EMD) measures the minimum “cost” to transform one distribution into another. This cost is calculated by summing over the absolute differences of the cumulative probabilities from each distribution up to each point \(i\), normalized by the total number of elements minus one (\(m - 1\)). By calculating the EMD between the distribution of the sensitive attribute in each equivalence class and the distribution in the full dataset, we can quantitatively measure how much the anonymized data deviates from the original distribution.
The provided formula for EMD is expressed as:
\[ E(P, Q) = \frac{1}{m - 1} \sum_{i=1}^m \left| \sum_{j=1}^i (p_j - q_j) \right| \]
where: - \(m\) is the number of elements in the distributions \(P\) and \(Q\). - \(p_j\) and \(q_j\) are the probabilities at position \(j\) in distributions \(P\) and \(Q\), respectively.
Choosing an appropriate threshold \(t\) for t-closeness is crucial to ensuring the effectiveness of this privacy-preserving method. The threshold \(t\) determines the maximum allowable distance between the distribution of a sensitive attribute within each equivalence class and its distribution across the entire dataset. A smaller \(t\) value indicates a stricter privacy constraint.
Unfortunately, “[a]t this point, no universal threshold for t appears to exist.” The authors mention that “the t-closeness principle can be applied using other distance measures. While EMD is the best measure we have found so far, it is certainly not perfect. In particular, the relationship between the value t and information gain is unclear.” Because of this, it is crucial to consider the following aspects:
1. Calculate Overall Distribution of Sensitive Attributes:
Now that we understand the purpose of t-closeness, the first step in
implementing t-closeness to our dataset is to determine the distribution
of each sensitive attribute, in this case hourwage and
diseae across the entire dataset. For the purposes of this
example, we will focus on only hourwage attribute and not
disease attribute as numeric data. A similar process should
be applied to categorical data.
overall_distribution_hourwage <- df_dummy %>%
summarise(mean_hourwage = mean(hourwage, na.rm = TRUE),
sd_hourwage = sd(hourwage, na.rm = TRUE))
head(overall_distribution_hourwage, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| mean_hourwage | sd_hourwage |
|---|---|
| 24.57222 | 7.942572 |
overall_distribution_disease <- df_dummy %>%
count(Disease) %>%
mutate(proportion = n / sum(n))
head(overall_distribution_disease, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| Disease | n | proportion |
|---|---|---|
| Alopecia | 2 | 0.2222222 |
| Diabetes | 2 | 0.2222222 |
| Gastric Ulcer | 1 | 0.1111111 |
| Gastritis | 1 | 0.1111111 |
| Heart Disease | 2 | 0.2222222 |
| Stomach Cancer | 1 | 0.1111111 |
2. Calculate Distribution Within Each Equivalence Class:
For each group of records that share the same quasi-identifiers (e.g., Region, sex, race, age_group, education_group), compute the distribution of the sensitive attributes.
equivalence_classes <- df_dummy %>%
group_by(Region, sex, race, age_group, education_group) %>%
summarise(mean_hourwage = mean(hourwage, na.rm = TRUE),
sd_hourwage = sd(hourwage, na.rm = TRUE),
disease_distribution = list(table(Disease) / length(Disease)),
.groups = "drop")
head(equivalence_classes, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| Region | sex | race | age_group | education_group | mean_hourwage | sd_hourwage | disease_distribution |
|---|---|---|---|---|---|---|---|
| Central Virginia | Female | Asian | 40-59 | High School | 15.98333 | 1.7322914 | 0.3333333, 0.3333333, 0.3333333 |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 32.40000 | 6.7505555 | 0.6666667, 0.3333333 |
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 25.33333 | 0.5773503 | 0.3333333, 0.6666667 |
3. Compare Distributions Using a Distance Metric:
Use a distance metric (like the Earth Mover’s Distance or Jensen-Shannon Divergence) to quantify how different the distributions within each equivalence class are from the overall distribution. For simplicity, let’s calculate the absolute difference in means:
equivalence_classes <- equivalence_classes %>%
mutate(hourwage_diff = abs(mean_hourwage - overall_distribution_hourwage$mean_hourwage))
head(equivalence_classes, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| Region | sex | race | age_group | education_group | mean_hourwage | sd_hourwage | disease_distribution | hourwage_diff |
|---|---|---|---|---|---|---|---|---|
| Central Virginia | Female | Asian | 40-59 | High School | 15.98333 | 1.7322914 | 0.3333333, 0.3333333, 0.3333333 | 8.5888889 |
| Eastern Virginia | Male | Black | 10-29 | Master’s Degree | 32.40000 | 6.7505555 | 0.6666667, 0.3333333 | 7.8277778 |
| Northern Virginia | Female | White | 30-49 | Bachelor’s Degree | 25.33333 | 0.5773503 | 0.3333333, 0.6666667 | 0.7611111 |
4. Adjust the Data to Meet t-Closeness:
If the difference exceeds a predefined threshold \(t\), adjust the data. This might involve:
Smoothing the values of the sensitive attribute.
For the Central Virginia equivalence class, we may increase the hourwage values slightly to bring the mean closer to 24.57. Given the low standard deviation, adjustments should be subtle to maintain the class’s internal consistency.
For the Eastern Virginia class, we may decrease the hourwage values to make the mean closer to 24.57, while considering the existing variability to avoid over-smoothing.
For the Northern Virginia class, minimal or no adjustment is needed since the hourwage mean is already close to the overall mean.
Grouping records further or making other data transformations to balance the distributions.
overall_mean <- overall_distribution_hourwage$mean_hourwage
t <- 0.1 # 10% threshold
equivalence_classes <- equivalence_classes %>%
mutate(t_closeness = ifelse(hourwage_diff <= overall_mean * t, TRUE, FALSE))
# Install the transport package
#install.packages("transport")
# Load the package
library(transport)
## Warning: package 'transport' was built under R version 4.3.3
# Overall wage distribution
overall_wage <- df_dummy$hourwage
# Initialize an empty list to store results
results <- list()
# Create a unique identifier for each group based on quasi-identifiers
df_dummy$GroupID <- with(df_dummy, interaction(Region, sex, race, age_group, education_group))
# Unique groups
unique_groups <- unique(df_dummy$GroupID)
# Loop over each group to calculate EMD
for (group_id in unique_groups) {
subgroup <- df_dummy[df_dummy$GroupID == group_id, ]
subgroup_wage <- subgroup$hourwage
# Calculate Wasserstein distance
emd_value <- wasserstein1d(subgroup_wage, overall_wage)
# Store results
results[[as.character(group_id)]] <- emd_value
}
# Display results
results
## $`Northern Virginia.Female.White.30-49.Bachelor's Degree`
## [1] 5.25
##
## $`Central Virginia.Female.Asian.40-59.High School`
## [1] 8.588889
##
## $`Eastern Virginia.Male.Black.10-29.Master's Degree`
## [1] 7.827778
# Example Data
df_data <- data.frame(
ZIP_Code = c("4767*", "4767*", "4767*", "4790*", "4790*", "4790*", "4760*", "4760*", "4760*"),
Age = c("<= 40", "<= 40", "<= 40", ">= 40", ">= 40", ">= 40", "<= 40", "<= 40", "<= 40"),
Salary = c("3K", "5K", "9K", "6K", "11K", "8K", "4K", "7K", "10K"),
Disease = c("gastric ulcer", "stomach cancer", "pneumonia", "gastritis", "flu", "bronchitis", "gastritis", "bronchitis", "stomach cancer")
)
# Convert Salary to numeric
df_data$Salary <- as.numeric(sub("K", "", df_data$Salary)) * 1000
table(df_data$Salary) / nrow(df_data)
##
## 3000 4000 5000 6000 7000 8000 9000 10000
## 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111
## 11000
## 0.1111111
create_auto_distributions <- function(data, attribute) {
# Determine unique values in the specified attribute to form equivalence classes
unique_values <- unique(data[[attribute]])
# Initialize list to hold distributions
distributions <- list()
# Total number of elements
n <- nrow(data)
# Loop through each unique value to create distributions
for (value in unique_values) {
# Create a zero vector of length n
distribution <- numeric(n)
# Find rows where the attribute matches the current unique value
indices <- which(data[[attribute]] == value)
# Assign equal probabilities to these indices
distribution[indices] <- 1 / length(indices)
# Name the distribution based on the unique value
distributions[[paste(attribute, value, sep = "_")]] <- distribution
}
# Return the list of distributions
return(distributions)
}
create_auto_distributions(df_data, "Salary")
## $Salary_3000
## [1] 1 0 0 0 0 0 0 0 0
##
## $Salary_5000
## [1] 0 1 0 0 0 0 0 0 0
##
## $Salary_9000
## [1] 0 0 1 0 0 0 0 0 0
##
## $Salary_6000
## [1] 0 0 0 1 0 0 0 0 0
##
## $Salary_11000
## [1] 0 0 0 0 1 0 0 0 0
##
## $Salary_8000
## [1] 0 0 0 0 0 1 0 0 0
##
## $Salary_4000
## [1] 0 0 0 0 0 0 1 0 0
##
## $Salary_7000
## [1] 0 0 0 0 0 0 0 1 0
##
## $Salary_10000
## [1] 0 0 0 0 0 0 0 0 1
# Define the distributions for uniform and equivalence classes
Q3 <- rep(1/9, 9)
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
P32 <- c(0, 0, 0, 1/3, 0, 1/3, 0, 0, 1/3)
P33 <- c(0, 0, 0, 0, 1/3, 0, 1/3, 1/3, 0)
# Define the distributions
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
Q3 <- rep(1/9, 9)
# Calculate EMD using the defined formula
emd_function <- function(P, Q) {
m <- length(P) # number of positions in the distributions
abs_cumulative_diffs <- abs(cumsum(P - Q)) # absolute cumulative differences
emd <- sum(abs_cumulative_diffs) / (m - 1) # EMD calculation
return(emd)
}
# Calculate EMD for P31 vs. Q3
emd_P31 <- emd_function(P31, Q3)
emd_P32 <- emd_function(P32, Q3)
emd_P33 <- emd_function(P33, Q3)
print(sprintf("EMD P3,1 vs. Q3: %f", emd_P31))
## [1] "EMD P3,1 vs. Q3: 0.375000"
print(sprintf("EMD P3,2 vs. Q3: %f", emd_P32))
## [1] "EMD P3,2 vs. Q3: 0.166667"
print(sprintf("EMD P3,3 vs. Q3: %f", emd_P33))
## [1] "EMD P3,3 vs. Q3: 0.236111"
# Parameters
P31 <- c(1/3, 1/3, 1/3, 0, 0, 0, 0, 0, 0)
Q3 <- rep(1/9, 9)
# Compute differences
differences <- P31 - Q3
# Compute cumulative sums of differences to find how much mass to move
cum_diff <- cumsum(differences)
# The total work is the sum of absolute cumulative differences
# (except the last one, because it should theoretically be zero if all previous are correct)
total_work <- sum(abs(cum_diff[-length(cum_diff)]))
# Since we move masses and we need to consider the distance moved, we calculate the EMD
distances <- 1:(length(P31) - 1) # distances from each position to the next
emd_value <- sum(abs(cum_diff[-length(cum_diff)]) * distances)
# Normalize by the total mass moved
emd_value_normalized <- emd_value / sum(P31)
# Print the EMD result
cat("EMD for P3,1 vs. Q3:", emd_value_normalized, "\n")
## EMD for P3,1 vs. Q3: 12
Let’s now turn to the discrimination aspect of fairness. How can we measure that a historical dataset contains a bias against a sensitive group, and how can we change the dataset in a preprocessing step so as to remove this measured bias from the dataset?
Positive outcome: An outcome we want to balance, like median hourly wage.
Sensitive group (S): Group that might face discrimination (e.g., Black).
Non-sensitive group (NS): Group that may have an advantage (e.g., White).
For this example, let’s assume df2 contains the data with columns race and hourwage.
# Grab the race and hourwage columns from df2
df_race_outcome <- df2 %>%
select(race, hourwage)
Step 1: Create the outcome variable Y based on
the median hourwage
# Calculate the median of hourwage
median_hourwage <- mean(df_race_outcome$hourwage, na.rm = TRUE)
# Create the outcome variable Y: "+" if hourwage >= median_hourwage, "-" otherwise
df_race_outcome <- df_race_outcome %>%
mutate(Y = ifelse(hourwage >= median_hourwage, 1, 0))
# View the resulting dataframe
head(df_race_outcome, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)
| race | hourwage | Y |
|---|---|---|
| White | 11.49 | 0 |
| White | 14.50 | 0 |
| White | 9.00 | 0 |
| White | 7.50 | 0 |
| White | 7.25 | 0 |
| White | 10.42 | 0 |
| White | 13.00 | 0 |
| White | 12.00 | 0 |
| White | 28.75 | 1 |
| White | 13.00 | 0 |
Step 2: Calculate the number of positive outcomes for non-sensitive (White) and sensitive (Black) groups
ns_positive <- df_race_outcome %>% filter(race == "White" & Y == 1) %>% nrow() # Count of positive outcomes for 'White'
s_positive <- df_race_outcome %>% filter(race == "Black" & Y == 1) %>% nrow() # Count of positive outcomes for 'Black'
# Print the results
cat("Number of positive outcomes for White:", ns_positive, "\n")
## Number of positive outcomes for White: 69361
cat("Number of positive outcomes for Black:", s_positive, "\n")
## Number of positive outcomes for Black: 7142
Step 3: Calculate the total number of instances for non-sensitive (White) and sensitive (Black) groups
ns_total <- df_race_outcome %>% filter(race == "White") %>% nrow() # Count of positive outcomes for 'White'
s_total <- df_race_outcome %>% filter(race == "Black") %>% nrow() # Count of positive outcomes for 'Black'
# Print the values
cat("Total instances for White group (ns_total):", ns_total, "\n")
## Total instances for White group (ns_total): 184338
cat("Total instances for Black group (s_total):", s_total, "\n")
## Total instances for Black group (s_total): 26113
Step 4: Calculate the probabilities
P_ns_positive <- ns_positive / ns_total # Probability of positive outcome for 'ns'
P_s_positive <- s_positive / s_total # Probability of positive outcome for 's'
# Print the probabilities
cat("Probability of positive outcome for White (P_ns_positive):", P_ns_positive, "\n")
## Probability of positive outcome for White (P_ns_positive): 0.3762708
cat("Probability of positive outcome for Black (P_s_positive):", P_s_positive, "\n")
## Probability of positive outcome for Black (P_s_positive): 0.2735036
Step 5: Calculate Statistical Parity and Disparate Impact
statistical_parity <- P_ns_positive - P_s_positive # Absolute difference in probabilities
disparate_impact <- P_ns_positive / P_s_positive # Ratio of probabilities
cat("Statistical Parity:", statistical_parity, "\n")
## Statistical Parity: 0.1027671
cat("Disparate Impact:", disparate_impact, "\n")
## Disparate Impact: 1.375743
Statistical Parity measures the absolute difference in the probability of a positive outcome between the non-sensitive (White) and sensitive (Black) groups. Here, a value of 0.1028 means that White individuals have about a 10.3% higher probability of receiving a positive outcome compared to Black individuals.
Disparate Impact measures the ratio of the probability of a positive outcome between the two groups. A value of 1.376 indicates that White individuals are about 1.38 times more likely to receive a positive outcome than Black individuals.
These results suggest that there might be a bias in the dataset favoring White individuals over Black individuals for positive outcomes. However, disparate impact t doesn’t consider too many other factors to matter. Thus, further examination of the data must be to asses the fairness more accurately.
S### Massaging
Massaging involves relabeling a specific number of instances to balance outcomes across groups.
To calculate the number of label changes \(M\) required to balance your dataset, you can use the formula:
\[ M = \frac{\text{disc}(D) \cdot s_T \cdot ns_T}{s_T + ns_T} \]
where:
Let’s substitute your calculated values into this formula:
\[ M = \frac{0.1028 \times 26113 \times 184338}{26113 + 184338} \]
The result for \(M\) will provide the number of label changes needed to reduce observed bias in the dataset by switching outcomes for individuals in both the sensitive and non-sensitive groups.
M <- (statistical_parity * s_total * ns_total) / (s_total + ns_total)
# Print the result for M
cat("Number of label changes required (M):", M, "\n")
## Number of label changes required (M): 2350.579