Introduction

This code through explores several way to deidentify data in R.

Content Overview

Specifically, we’ll explain and demonstrate a manual method of deidentification and the use of different packages. The manual method uses a sequence function and the packages covered are ‘anonymizer’, ‘deidentifyr’, and ‘digest’.

Why You Should Care

Deidentification is the process of removing personally identifiable information (PII) from data. PII can include items such as banking information, Social Security numbers, and addresses. This topic is valuable because data deidentification protects the privacy of individuals and is important in many industries, including but not limited to health care, banking, pharmaceuticals, and education. Deidentification ensures compliance to ethical and legal standards of data collection and analysis.

Learning Objectives

Specifically, you’ll learn how to…

Manually remove PII using base R
Create a unique identifier using package ‘anonymizer’
Create a unique identifier using package ‘deidentifyr’
Create a unique identifier using package ‘digest’

Dataset

For the purposes of this demonstration we will be using a toy dataset with the basic health information of a fictional family. The goal will be to deidentify the information by removing each person’s name. Here is a view of the data:

head(data) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", 
                                      "hover",
                                      "condensed"))

name	sex	age	height	weight
Homer	M	36	6	250
Marge	F	36	5	120
Bart	M	10	4	75
Lisa	F	8	3	57
Maggie	F	1	2	20

Manual Method

The simplest way to deidentify data is by simply removing the PII. This can be accomplished by selecting the rows without PII. For example:

data <- data[ , c("sex", "age", "height", "weight")]

head(data) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", 
                                      "hover",
                                      "condensed"))

sex	age	height	weight
M	36	6	250
F	36	5	120
M	10	4	75
F	8	3	57
F	1	2	20

However, this technique loses the ability to uniquely identify each individual and doesn’t allow for future merging. To address this, a column can be created that names each individual as their row location.

data$ID <- seq.int(nrow(data))        #Create ID

order <- c("ID", "sex", "age",
           "height", "weight")    #Reorder the columns
data <- data[,order]

head(data) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", 
                                      "hover",
                                      "condensed"))

ID	sex	age	height	weight
1	M	36	6	250
2	F	36	5	120
3	M	10	4	75
4	F	8	3	57
5	F	1	2	20

Package ‘anonymizer’

Package ‘anonymizer’ uses a mix of methods to replace PII with a random unique identifier (Hendricks, 2015). The package can be installed from CRAN or from GitHub depending on your version of R. The package can be used as follows:

data$name <- anonymize(data$name, .algo = "crc32")

head(data) %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", 
                                      "hover",
                                      "condensed"))

name	sex	age	height	weight
84f5f606	M	36	6	250
86c9e65d	F	36	5	120
9af00579	M	10	4	75
74bb5dc2	F	8	3	57
43a7b0c2	F	1	2	20

*Note: The argument ‘.algo’ allows the selection of a hash algorithm. The algorithm ‘crc32’ was used because of its short length, but others can be used depending on your need.

Package ‘deidentifyr’

Another package that can be used for data deidentification is ‘deidentifyr.’ Using a slightly longer SHA-256 hash to generate a unique ID code, this package aims to avoid the potential recovery of hashed PII (Wilcox, 2019). This package is not yet on CRAN, but can be installed from GitHub. The package can be used as follows:

data$name <- deidentify(data, name, sex, age, height, weight)

head(data) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", 
                                      "hover",
                                      "condensed"))

name	sex	age	height	weight
aabca79f52	M	36	6	250
aed7926143	F	36	5	120
9bed3d38cb	M	10	4	75
fe715ca98a	F	8	3	57
e2c0a02e62	F	1	2	20

*Note: This package is experimental and still under development.

Package ‘digest’

A third package that can be used is ‘digest.’ This package generates a hashed character string and a variety of algorithms can be used depending on your need (Eddelbuettel et al., 2020). The package can be used as follows:

data$name <- sapply(data$name, digest, algo = "crc32")

head(data) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", 
                                      "hover",
                                      "condensed"))

name	sex	age	height	weight
b03edabf	M	36	6	250
3e9e21f4	F	36	5	120
1e81279c	M	10	4	75
848bd4bd	F	8	3	57
afae2042	F	1	2	20

Considerations

While these deidentification methods are useful in protecting the personal information of individuals, they should be combined with other best practices in order to comply with legal and ethical standards and to create multiple levels of protection.

Further Resources

Learn more about PII, deidentification, and other packages with the following resources:

Article on Personally Identifiable Information (PII): https://www.investopedia.com/terms/p/personally-identifiable-information-pii.asp
Video on Deidentification: https://www.youtube.com/watch?v=ULLR-UkG7_A
List of R data deidentification packages: https://osf.io/k96ah/?pid=eh9bj

Works Cited

This code through references and cites the following sources:

Paul Hendricks (2015). “anonymizer: Anonymize Data Containing Personally Identifiable Information.” https://github.com/paulhendricks/anonymizer
Wilcox (2019). “deidentify: Deidentify a dataset.” https://rdrr.io/github/wilkox/deidentifyr/man/deidentify.html
Eddelbuettel , et. al (2020). “Package ‘digest’.” https://cran.r-project.org/web/packages/digest/digest.pdf

Deidentifying Data

Cassidy Kantoris

24 July 2020