Anonymising you data for export or Vignette use

Here we will learn a method of anonymising your data in way that can be recovered later by renaming important data with an arbitrary reference ID and a reference table.

we will need a few libraries, and a list of names;

library("tidyverse", "dplyr")

## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("babynames")

babynames <- babynames

snames <- babynames %>%
  select("name")

Arbitrary replacement and a reference table

Create a dataset

Imagine we need help manipulating our data, or are stuck with a tricky issue, if other people could help, maybe they could solve the problem. However you data is proprietary so you can’t share it, you could recreate a similar table with arbitrary data, but it might not represent the issue correctly, or we can anonymise our existing data so that we can share it. However we likely need a way of recoving our original data once the solution has been applied.

Let’s first create some secret data;

secretsquirrelbusiness <- data.frame("sqnames" = sample_n(snames, 15), "nuts collected" = sample(1:20, 15, replace = TRUE), "hours worked"= (runif(15, 4, 12)), "family size" = sample(1:10, 15, replace = TRUE))

Here data frame will generate us a new data frame from defined vectors, our first vector uses sample_n to randomly select from our name list “snames” afterward we use sample within a range to generate integers, and runif to return a random number

We can see the top of our top secret table with;

head(secretsquirrelbusiness)

##        name nuts.collected hours.worked family.size
## 1 Kristeena             12     7.340366           1
## 2     Jerre             20     4.602814           9
## 3    Boston              4     7.642532           1
## 4     Angel              9     8.451196           6
## 5   Makalia              9     4.562885           6
## 6  Shakirah             20     6.126706           4

Now, say we need help determining the mean of nuts collected by our squirrels, it’s a fair question, but Kristeena probably doesn’t want us telling the world his name or that he works almost 12 hours.

Our first challenge when anonymising data is deciding what we need to obscure and what information is necessary. The mean of nuts collected might be easy, we can just send the first column, but if we wanted to graph multiple variables it will likely be easier to send the whole table with identifying data obscured. However, if we obscure too much data, the table will no longer be useful.

For our example we will be obscuring our squirrels names.

Create a reference table

First lets create our reference ID table;

reftable <- data.frame("sqname"= secretsquirrelbusiness$name, "ID"= paste("squirrel", sample(1:100, 15, replace=FALSE), sep = " "), stringsAsFactors=FALSE)

Again we are using data.frame our first vector is the first column of the frame secretsquirrelbusiness and our second is a paste function, this will combine the string “squirrel” with a random number from our sample function, we have chose to use the replace = FALSE argument as this avoids duplicates. data.frame' is argumented withstringsAsFactors=FALSE` to ensure our output strings are stored as characters instead of factors, to match our intial table

Replacing data

Now we can join the tables;

 anonymoustable <- inner_join(reftable, secretsquirrelbusiness, by = c("sqname" = "name"))

inner_join is part of the dplyr package, this function merges two tables by a common variable in this case by telling the function that the character vectors “sqname” and “name” are the same using by =

And remove our squirrel friends names;

anonymoustable <- anonymoustable[-1]

Rewriting the anonymous table by selecting the whole table except the first column [-1]

And viola we have our anonymised table;

print(anonymoustable)

##             ID nuts.collected hours.worked family.size
## 1  squirrel 50             12     7.340366           1
## 2  squirrel 36             20     4.602814           9
## 3  squirrel 94              4     7.642532           1
## 4  squirrel 40              9     8.451196           6
## 5  squirrel 24              9     4.562885           6
## 6  squirrel 43             20     6.126706           4
## 7  squirrel 87              5     8.423120           1
## 8  squirrel 56              9     9.938206           7
## 9  squirrel 97              6     4.474046           2
## 10 squirrel 59              4     5.648565           3
## 11 squirrel 96             16     7.180823           9
## 12 squirrel 83             10    11.594936           1
## 13 squirrel 58             15     8.855614           6
## 14 squirrel 47             16     6.395909           5
## 15 squirrel 22              7    11.344473           8

Retrieving references

Later once we get our data back, if we need to check who is who, we can always re-join using our reference table;

checktable <- inner_join(reftable, anonymoustable)

## Joining, by = "ID"

print(checktable)

##       sqname          ID nuts.collected hours.worked family.size
## 1  Kristeena squirrel 50             12     7.340366           1
## 2      Jerre squirrel 36             20     4.602814           9
## 3     Boston squirrel 94              4     7.642532           1
## 4      Angel squirrel 40              9     8.451196           6
## 5    Makalia squirrel 24              9     4.562885           6
## 6   Shakirah squirrel 43             20     6.126706           4
## 7       Avry squirrel 87              5     8.423120           1
## 8    Novalie squirrel 56              9     9.938206           7
## 9      Meeka squirrel 97              6     4.474046           2
## 10    Sylena squirrel 59              4     5.648565           3
## 11  Patrecia squirrel 96             16     7.180823           9
## 12   Gayleen squirrel 83             10    11.594936           1
## 13 Druscilla squirrel 58             15     8.855614           6
## 14   Maylene squirrel 47             16     6.395909           5
## 15   Natoria squirrel 22              7    11.344473           8

Again we use inner_join to merge our tables using matched variables, here since the column names ID are the same we do not need to specify, as R tried to find a common variable to match automatically.

There we go, a method of anonymising our data table in a way that is later recoverable to it’s original state. This can be used for multiple variables utilising multiple reference tables.

Anonymising your data

Joshua McCarthy

31 March 2019