Ken Ikeda
Open access to original clinical study reports is important for health science development. Then, the challenge is to efficiently detect personal information , in order to prserve the anonymity of study participants. Such information includes, but not limited to, person names, addresses, etc. The data like email addresses and phone numbers can be easily detected based on their string patterns by means of base R functions and the R packages like stringi and stringr. So I have been seeking methods to detect ADDRESS, NAME and ORGANIZATION in texts.
This is a trial to detect ADDRESS and NAME in medical reports, using the cleanNLP package with Stanford coreNLP engine. Stanford coreNLP looked to have a better capability in named entity recognition than openNLP with “openNLPmodels.en”" but this remains to be studied. Of note is that when I ran the same scripts on an Windows 8 machine, encountered was an encoding error (use ASCII/UTF-8!) at the annotation step, cnlp_annotate(). This did not happen in R 3.4.3 on Mac OS High Sierra.
The following clinical study report was used: “http://www.maps.org/research-archive/mdma/MP-2_CSR_FINAL_15Sep11.pdf

Install the cleanNLP package and the coreNLP model

You need to run the following commands

library(markdown)
library(dplyr)
library(pdftools)
library(cleanNLP)
library(stringr)
library(knitr)
#install.packages("cleanNLP", dependencies = T)
#library(cleanNLP)
#cnlp_download_corenlp() # This took a few minutes

Read a sample pdf report and annotate it

The csr was a vector of characters (62 pages).
You must execute the cnlp_init_corenlp() function before analyzing new text data.
anno_level = 2 has named-entity recognition, if it is set to 3 or higher, more returns.

#Read a sample Clinical Study Report
csr <- pdf_text("http://www.maps.org/research-archive/mdma/MP-2_CSR_FINAL_15Sep11.pdf")
#Everytime you must run the FUNCTION cnlp_init_corenlp(), this needs take few seconds but faster in the subsequent
library(rJava)
cnlp_init_corenlp(language = "en", anno_level = 2, mem = "6g")
#Annotate the Clinical Study Report
annotated <- cnlp_annotate(csr, as_strings = TRUE, backend = "coreNLP")

Locate Data of interests by Entity Type

Entities available are:
“LOCATION”, “PERSON”, “ORGANIZATION” ,“MONEY”, “PERCENT”,“DATE”, and “TIME”.

annotated$entity %>% 
    filter(entity_type %in% c("PERSON", "LOCATION")) %>%
    select(id, entity_type, entity) %>% 
    mutate(page=as.integer(str_extract(id, "\\d+"))) %>% 
    select(page, entity_type, entity) %>% 
    arrange(page) %>% kable(format="markdown")
page entity_type entity
1 LOCATION 309 Cedar Street
1 LOCATION # 2323 Santa Cruz
1 PERSON Rick Doblin
1 PERSON Peter Oehen
2 PERSON Peter Oehen
2 PERSON Verena Widmer
2 PERSON Rafael Traber
2 LOCATION Switzerland
4 PERSON Oehen
10 LOCATION U.S.
13 LOCATION Canton
13 LOCATION Aargau
13 LOCATION Solothurn
13 LOCATION Switzerland
13 LOCATION Helsinki
13 PERSON Peter Oehen
13 PERSON Oehen
13 PERSON Verena Widmer
13 PERSON Oehen
13 LOCATION Switzerland
13 PERSON Oehen
13 PERSON Rafael Traber
13 PERSON Peter Oehen
13 PERSON Verena Widmer
13 PERSON Rafael Traber
13 PERSON Rudolf Brenneisen
13 PERSON Michael Mithoefer
13 LOCATION Ilsa Jerome
13 PERSON Christoph Kopp
13 LOCATION U.S.
13 LOCATION Europe
13 LOCATION Spain
14 LOCATION U.S
14 PERSON Stanislav Grof
14 PERSON Holotropic Breathwork
17 PERSON Oehen
19 LOCATION Switzerland
19 PERSON Franz Vollenweider
19 LOCATION Switzerland
19 PERSON Rudolf Brenneisen
19 LOCATION Switzerland
19 PERSON Brenneisen
19 PERSON Laboratory Bichsel
19 LOCATION Interlaken
19 LOCATION Switzerland
20 PERSON M. Collenberg
20 LOCATION U.S.
20 PERSON Charles Grob
28 PERSON Wilcoxon
34 PERSON Mauchly
36 PERSON Mauchly
37 PERSON Mauchly
37 PERSON Mauchly
37 PERSON Mauchly
38 LOCATION Switzerland
38 PERSON Bonferroni
44 PERSON Low
53 PERSON Morbus Meulengracht
60 PERSON Weathers
60 PERSON T.M. Keane
60 PERSON J.R. Davidson
60 PERSON Blake
60 PERSON Grof
60 PERSON Ben Lomond
60 PERSON Greer
60 PERSON G.R.
60 PERSON R. Tolbert
60 PERSON Grof
60 LOCATION Albany
60 PERSON Metzner
60 PERSON S. Adamson
60 PERSON J. Holland
60 LOCATION Rochester
60 PERSON Mithoefer
60 PERSON D. Hyman
60 PERSON Berl
60 PERSON Dumont
60 PERSON Cami
60 PERSON J Clin Psychopharmacol
60 PERSON Dumont
60 PERSON G.J.
60 PERSON R.J. Verkes
60 PERSON J Psychopharmacol
60 PERSON C.M.
60 PERSON Kirkpatrick
60 PERSON Berl
60 PERSON Kolbrich
60 PERSON Liechti
60 PERSON Mithoefer
60 LOCATION M.C.
60 PERSON J Psychopharmacol
60 PERSON Tancer
60 PERSON C.E. Johanson
60 PERSON Berl
61 PERSON Yubero-Lahoz
61 PERSON Clin Pharmacokinet
61 PERSON J Psychopharmacol
61 PERSON Johanson
61 PERSON K.P.
61 PERSON J.G. Ramaekers
61 PERSON Berl
61 PERSON Grob
61 PERSON Harris
61 PERSON Berl
61 PERSON Greer
61 PERSON R. Tolbert
61 PERSON Grob
61 LOCATION C.S.
61 PERSON Lester
61 PERSON Ann Intern Med
61 PERSON Peter Oehen
61 PERSON Vollenweider
61 PERSON Tancer
61 PERSON C.E. Johanson
61 PERSON Tancer
61 PERSON C.E. Johanson
61 PERSON Stolaroff
61 LOCATION Sarasota
61 LOCATION New York
61 PERSON Blake
61 PERSON Schnyder
61 PERSON U.
61 PERSON H. Moergeli
61 PERSON Foa
61 LOCATION E.B.
61 PERSON Foa
61 LOCATION E.B.
61 PERSON Newman
61 PERSON Brady
62 PERSON Davidson
62 PERSON Marshall
62 PERSON R.D.
62 PERSON Tucker
62 PERSON Brunner
62 PERSON S. Domhof
62 PERSON F. Langer
62 LOCATION New York
62 PERSON Wiley
62 PERSON Brunner
62 PERSON U. Munzel
62 PERSON Nichtparametrische Datenanalyse
62 LOCATION Berlin
62 LOCATION Los Angeles
62 PERSON Marshall
62 PERSON Weathers
62 LOCATION Los Angeles
62 LOCATION Krakow
62 PERSON M. Hollifield
62 PERSON Jama

—————— eof ———–

LS0tCnRpdGxlOiBFeHRyYWN0aW9uIG9mIFBFUlNPTiBhbmQgTE9DQVRJT04gZnJvbSBDbGluaWNhbCBTdHVkeSBSZXBvcnQsIHVzaW5nIGNsZWFuTkxQCiAgd2l0aCBTdGFuZm9yZCBjb3JlTkxQIGJhY2tlbmQKb3V0cHV0OgogIGh0bWxfbm90ZWJvb2s6IGRlZmF1bHQKICBodG1sX2RvY3VtZW50OgogICAgZGZfcHJpbnQ6IHBhZ2VkCi0tLQpLZW4gSWtlZGEgIAogIE9wZW4gYWNjZXNzIHRvIG9yaWdpbmFsIGNsaW5pY2FsIHN0dWR5IHJlcG9ydHMgaXMgaW1wb3J0YW50IGZvciBoZWFsdGggc2NpZW5jZSBkZXZlbG9wbWVudC4gVGhlbiwgdGhlIGNoYWxsZW5nZSBpcyB0byBlZmZpY2llbnRseSBkZXRlY3QgcGVyc29uYWwgaW5mb3JtYXRpb24gLCBpbiBvcmRlciB0byBwcnNlcnZlIHRoZSBhbm9ueW1pdHkgb2Ygc3R1ZHkgcGFydGljaXBhbnRzLiBTdWNoIGluZm9ybWF0aW9uIGluY2x1ZGVzLCBidXQgbm90IGxpbWl0ZWQgdG8sIHBlcnNvbiBuYW1lcywgYWRkcmVzc2VzLCBldGMuIFRoZSBkYXRhIGxpa2UgZW1haWwgYWRkcmVzc2VzIGFuZCBwaG9uZSBudW1iZXJzIGNhbiBiZSBlYXNpbHkgZGV0ZWN0ZWQgYmFzZWQgb24gdGhlaXIgc3RyaW5nIHBhdHRlcm5zIGJ5IG1lYW5zIG9mIGJhc2UgUiBmdW5jdGlvbnMgYW5kIHRoZSBSIHBhY2thZ2VzIGxpa2Ugc3RyaW5naSBhbmQgc3RyaW5nci4gU28gSSBoYXZlIGJlZW4gc2Vla2luZyBtZXRob2RzIHRvIGRldGVjdCBBRERSRVNTLCBOQU1FIGFuZCBPUkdBTklaQVRJT04gaW4gdGV4dHMuICAKICBUaGlzIGlzIGEgdHJpYWwgdG8gZGV0ZWN0IEFERFJFU1MgYW5kIE5BTUUgaW4gbWVkaWNhbCByZXBvcnRzLCB1c2luZyB0aGUgY2xlYW5OTFAgcGFja2FnZSB3aXRoIFN0YW5mb3JkIGNvcmVOTFAgZW5naW5lLiBTdGFuZm9yZCBjb3JlTkxQIGxvb2tlZCB0byBoYXZlIGEgYmV0dGVyIGNhcGFiaWxpdHkgaW4gbmFtZWQgZW50aXR5IHJlY29nbml0aW9uIHRoYW4gb3Blbk5MUCB3aXRoICJvcGVuTkxQbW9kZWxzLmVuIiIgYnV0IHRoaXMgcmVtYWlucyB0byBiZSBzdHVkaWVkLgogIE9mIG5vdGUgaXMgdGhhdCB3aGVuIEkgcmFuIHRoZSBzYW1lIHNjcmlwdHMgb24gYW4gV2luZG93cyA4IG1hY2hpbmUsIGVuY291bnRlcmVkIHdhcyBhbiBlbmNvZGluZyBlcnJvciAodXNlIEFTQ0lJL1VURi04ISkgYXQgdGhlIGFubm90YXRpb24gc3RlcCwgY25scF9hbm5vdGF0ZSgpLiBUaGlzIGRpZCBub3QgaGFwcGVuIGluIFIgMy40LjMgb24gTWFjIE9TIEhpZ2ggU2llcnJhLiAgCiAgVGhlIGZvbGxvd2luZyBjbGluaWNhbCBzdHVkeSByZXBvcnQgd2FzIHVzZWQ6ICAgImh0dHA6Ly93d3cubWFwcy5vcmcvcmVzZWFyY2gtYXJjaGl2ZS9tZG1hL01QLTJfQ1NSX0ZJTkFMXzE1U2VwMTEucGRmIiAgCgojI0luc3RhbGwgdGhlIGNsZWFuTkxQIHBhY2thZ2UgYW5kIHRoZSBjb3JlTkxQIG1vZGVsICAKWW91IG5lZWQgdG8gcnVuIHRoZSBmb2xsb3dpbmcgY29tbWFuZHMgIApgYGB7ciBpbnN0YWxsIHBhY2thZ2VzIGFuZCBTdGFuZm9yZCBBbm5vdGF0aW9uIGVuZ2luZX0KbGlicmFyeShtYXJrZG93bikKbGlicmFyeShkcGx5cikKbGlicmFyeShwZGZ0b29scykKbGlicmFyeShjbGVhbk5MUCkKbGlicmFyeShzdHJpbmdyKQpsaWJyYXJ5KGtuaXRyKQojaW5zdGFsbC5wYWNrYWdlcygiY2xlYW5OTFAiLCBkZXBlbmRlbmNpZXMgPSBUKQojbGlicmFyeShjbGVhbk5MUCkKI2NubHBfZG93bmxvYWRfY29yZW5scCgpICMgRW5nbGlzaCBtb2RlbCwgdGhpcyB0b29rIGEgZmV3IG1pbnV0ZXMKYGBgCgojI1JlYWQgYSBzYW1wbGUgcGRmIHJlcG9ydCBhbmQgYW5ub3RhdGUgaXQgIApUaGUgY3NyIHdhcyBhIHZlY3RvciBvZiBjaGFyYWN0ZXJzICg2MiBwYWdlcykuICAKWW91IG11c3QgZXhlY3V0ZSB0aGUgY25scF9pbml0X2NvcmVubHAoKSBmdW5jdGlvbiBiZWZvcmUgYW5hbHl6aW5nIG5ldyB0ZXh0IGRhdGEuICAKYW5ub19sZXZlbCA9IDIgaGFzIG5hbWVkLWVudGl0eSByZWNvZ25pdGlvbiwgaWYgaXQgaXMgc2V0IHRvIDMgb3IgaGlnaGVyLCBtb3JlIHJldHVybnMuICAKYGBge3IgcmVhZCBhbm5vdGF0ZX0KI1JlYWQgYSBzYW1wbGUgQ2xpbmljYWwgU3R1ZHkgUmVwb3J0CmNzciA8LSBwZGZfdGV4dCgiaHR0cDovL3d3dy5tYXBzLm9yZy9yZXNlYXJjaC1hcmNoaXZlL21kbWEvTVAtMl9DU1JfRklOQUxfMTVTZXAxMS5wZGYiKQoKI0V2ZXJ5dGltZSB5b3UgbXVzdCBydW4gdGhlIEZVTkNUSU9OIGNubHBfaW5pdF9jb3JlbmxwKCksIHRoaXMgbmVlZHMgdGFrZSBmZXcgc2Vjb25kcyBidXQgZmFzdGVyIGluIHRoZSBzdWJzZXF1ZW50CmxpYnJhcnkockphdmEpCmNubHBfaW5pdF9jb3JlbmxwKGxhbmd1YWdlID0gImVuIiwgYW5ub19sZXZlbCA9IDIsIG1lbSA9ICI2ZyIpCiNBbm5vdGF0ZSB0aGUgQ2xpbmljYWwgU3R1ZHkgUmVwb3J0CmFubm90YXRlZCA8LSBjbmxwX2Fubm90YXRlKGNzciwgYXNfc3RyaW5ncyA9IFRSVUUsIGJhY2tlbmQgPSAiY29yZU5MUCIpCmBgYAoKIyNMb2NhdGUgRGF0YSBvZiBpbnRlcmVzdHMgYnkgRW50aXR5IFR5cGUKRW50aXRpZXMgYXZhaWxhYmxlIGFyZTogIAoiTE9DQVRJT04iLCAiUEVSU09OIiwgIk9SR0FOSVpBVElPTiIgLCJNT05FWSIsICJQRVJDRU5UIiwiREFURSIsIGFuZCAiVElNRSIuICAgICAKYGBge3IgZW50aXR5fQphbm5vdGF0ZWQkZW50aXR5ICU+JSAKICAgIGZpbHRlcihlbnRpdHlfdHlwZSAlaW4lIGMoIlBFUlNPTiIsICJMT0NBVElPTiIpKSAlPiUKICAgIHNlbGVjdChpZCwgZW50aXR5X3R5cGUsIGVudGl0eSkgJT4lIAogICAgbXV0YXRlKHBhZ2U9YXMuaW50ZWdlcihzdHJfZXh0cmFjdChpZCwgIlxcZCsiKSkpICU+JSAKICAgIHNlbGVjdChwYWdlLCBlbnRpdHlfdHlwZSwgZW50aXR5KSAlPiUgCiAgICBhcnJhbmdlKHBhZ2UpICU+JSBrYWJsZShmb3JtYXQ9Im1hcmtkb3duIikKYGBgCiMjICAtLS0tLS0tLS0tLS0tLS0tLS0gZW9mIC0tLS0tLS0tLS0tICAKCgo=