This code through explores several way to deidentify data in R.
Specifically, we’ll explain and demonstrate a manual method of deidentification and the use of different packages. The manual method uses a sequence function and the packages covered are ‘anonymizer’, ‘deidentifyr’, and ‘digest’.
Deidentification is the process of removing personally identifiable information (PII) from data. PII can include items such as banking information, Social Security numbers, and addresses. This topic is valuable because data deidentification protects the privacy of individuals and is important in many industries, including but not limited to health care, banking, pharmaceuticals, and education. Deidentification ensures compliance to ethical and legal standards of data collection and analysis.
Specifically, you’ll learn how to…
For the purposes of this demonstration we will be using a toy dataset with the basic health information of a fictional family. The goal will be to deidentify the information by removing each person’s name. Here is a view of the data:
| name | sex | age | height | weight |
|---|---|---|---|---|
| Homer | M | 36 | 6 | 250 |
| Marge | F | 36 | 5 | 120 |
| Bart | M | 10 | 4 | 75 |
| Lisa | F | 8 | 3 | 57 |
| Maggie | F | 1 | 2 | 20 |
The simplest way to deidentify data is by simply removing the PII. This can be accomplished by selecting the rows without PII. For example:
data <- data[ , c("sex", "age", "height", "weight")]
head(data) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"))| sex | age | height | weight |
|---|---|---|---|
| M | 36 | 6 | 250 |
| F | 36 | 5 | 120 |
| M | 10 | 4 | 75 |
| F | 8 | 3 | 57 |
| F | 1 | 2 | 20 |
However, this technique loses the ability to uniquely identify each individual and doesn’t allow for future merging. To address this, a column can be created that names each individual as their row location.
data$ID <- seq.int(nrow(data)) #Create ID
order <- c("ID", "sex", "age",
"height", "weight") #Reorder the columns
data <- data[,order]
head(data) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"))| ID | sex | age | height | weight |
|---|---|---|---|---|
| 1 | M | 36 | 6 | 250 |
| 2 | F | 36 | 5 | 120 |
| 3 | M | 10 | 4 | 75 |
| 4 | F | 8 | 3 | 57 |
| 5 | F | 1 | 2 | 20 |
Package ‘anonymizer’ uses a mix of methods to replace PII with a random unique identifier (Hendricks, 2015). The package can be installed from CRAN or from GitHub depending on your version of R. The package can be used as follows:
data$name <- anonymize(data$name, .algo = "crc32")
head(data) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"))| name | sex | age | height | weight |
|---|---|---|---|---|
| 84f5f606 | M | 36 | 6 | 250 |
| 86c9e65d | F | 36 | 5 | 120 |
| 9af00579 | M | 10 | 4 | 75 |
| 74bb5dc2 | F | 8 | 3 | 57 |
| 43a7b0c2 | F | 1 | 2 | 20 |
*Note: The argument ‘.algo’ allows the selection of a hash algorithm. The algorithm ‘crc32’ was used because of its short length, but others can be used depending on your need.
Another package that can be used for data deidentification is ‘deidentifyr.’ Using a slightly longer SHA-256 hash to generate a unique ID code, this package aims to avoid the potential recovery of hashed PII (Wilcox, 2019). This package is not yet on CRAN, but can be installed from GitHub. The package can be used as follows:
data$name <- deidentify(data, name, sex, age, height, weight)
head(data) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed")) | name | sex | age | height | weight |
|---|---|---|---|---|
| aabca79f52 | M | 36 | 6 | 250 |
| aed7926143 | F | 36 | 5 | 120 |
| 9bed3d38cb | M | 10 | 4 | 75 |
| fe715ca98a | F | 8 | 3 | 57 |
| e2c0a02e62 | F | 1 | 2 | 20 |
*Note: This package is experimental and still under development.
A third package that can be used is ‘digest.’ This package generates a hashed character string and a variety of algorithms can be used depending on your need (Eddelbuettel et al., 2020). The package can be used as follows:
data$name <- sapply(data$name, digest, algo = "crc32")
head(data) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"))| name | sex | age | height | weight |
|---|---|---|---|---|
| b03edabf | M | 36 | 6 | 250 |
| 3e9e21f4 | F | 36 | 5 | 120 |
| 1e81279c | M | 10 | 4 | 75 |
| 848bd4bd | F | 8 | 3 | 57 |
| afae2042 | F | 1 | 2 | 20 |
While these deidentification methods are useful in protecting the personal information of individuals, they should be combined with other best practices in order to comply with legal and ethical standards and to create multiple levels of protection.
Learn more about PII, deidentification, and other packages with the following resources:
Article on Personally Identifiable Information (PII): https://www.investopedia.com/terms/p/personally-identifiable-information-pii.asp
Video on Deidentification: https://www.youtube.com/watch?v=ULLR-UkG7_A
List of R data deidentification packages: https://osf.io/k96ah/?pid=eh9bj
This code through references and cites the following sources:
Paul Hendricks (2015). “anonymizer: Anonymize Data Containing Personally Identifiable Information.” https://github.com/paulhendricks/anonymizer
Wilcox (2019). “deidentify: Deidentify a dataset.” https://rdrr.io/github/wilkox/deidentifyr/man/deidentify.html
Eddelbuettel , et. al (2020). “Package ‘digest’.” https://cran.r-project.org/web/packages/digest/digest.pdf