Ken Ikeda
Open access to original clinical study reports is important for health science development. Then, the challenge is to efficiently detect personal information , in order to prserve the anonymity of study participants. Such information includes, but not limited to, person names, addresses, etc. The data like email addresses and phone numbers can be easily detected based on their string patterns by means of base R functions and the R packages like stringi and stringr. So I have been seeking methods to detect ADDRESS, NAME and ORGANIZATION in texts.
This is a trial to detect ADDRESS and NAME in medical reports, using the cleanNLP package with Stanford coreNLP engine. Stanford coreNLP looked to have a better capability in named entity recognition than openNLP with “openNLPmodels.en”" but this remains to be studied. Of note is that when I ran the same scripts on an Windows 8 machine, encountered was an encoding error (use ASCII/UTF-8!) at the annotation step, cnlp_annotate(). This did not happen in R 3.4.3 on Mac OS High Sierra.
The following clinical study report was used: “http://www.maps.org/research-archive/mdma/MP-2_CSR_FINAL_15Sep11.pdf”
You need to run the following commands
library(markdown)
library(dplyr)
library(pdftools)
library(cleanNLP)
library(stringr)
library(knitr)
#install.packages("cleanNLP", dependencies = T)
#library(cleanNLP)
#cnlp_download_corenlp() # This took a few minutes
The csr was a vector of characters (62 pages).
You must execute the cnlp_init_corenlp() function before analyzing new text data.
anno_level = 2 has named-entity recognition, if it is set to 3 or higher, more returns.
#Read a sample Clinical Study Report
csr <- pdf_text("http://www.maps.org/research-archive/mdma/MP-2_CSR_FINAL_15Sep11.pdf")
#Everytime you must run the FUNCTION cnlp_init_corenlp(), this needs take few seconds but faster in the subsequent
library(rJava)
cnlp_init_corenlp(language = "en", anno_level = 2, mem = "6g")
#Annotate the Clinical Study Report
annotated <- cnlp_annotate(csr, as_strings = TRUE, backend = "coreNLP")
Entities available are:
“LOCATION”, “PERSON”, “ORGANIZATION” ,“MONEY”, “PERCENT”,“DATE”, and “TIME”.
annotated$entity %>%
filter(entity_type %in% c("PERSON", "LOCATION")) %>%
select(id, entity_type, entity) %>%
mutate(page=as.integer(str_extract(id, "\\d+"))) %>%
select(page, entity_type, entity) %>%
arrange(page) %>% kable(format="markdown")
page | entity_type | entity |
---|---|---|
1 | LOCATION | 309 Cedar Street |
1 | LOCATION | # 2323 Santa Cruz |
1 | PERSON | Rick Doblin |
1 | PERSON | Peter Oehen |
2 | PERSON | Peter Oehen |
2 | PERSON | Verena Widmer |
2 | PERSON | Rafael Traber |
2 | LOCATION | Switzerland |
4 | PERSON | Oehen |
10 | LOCATION | U.S. |
13 | LOCATION | Canton |
13 | LOCATION | Aargau |
13 | LOCATION | Solothurn |
13 | LOCATION | Switzerland |
13 | LOCATION | Helsinki |
13 | PERSON | Peter Oehen |
13 | PERSON | Oehen |
13 | PERSON | Verena Widmer |
13 | PERSON | Oehen |
13 | LOCATION | Switzerland |
13 | PERSON | Oehen |
13 | PERSON | Rafael Traber |
13 | PERSON | Peter Oehen |
13 | PERSON | Verena Widmer |
13 | PERSON | Rafael Traber |
13 | PERSON | Rudolf Brenneisen |
13 | PERSON | Michael Mithoefer |
13 | LOCATION | Ilsa Jerome |
13 | PERSON | Christoph Kopp |
13 | LOCATION | U.S. |
13 | LOCATION | Europe |
13 | LOCATION | Spain |
14 | LOCATION | U.S |
14 | PERSON | Stanislav Grof |
14 | PERSON | Holotropic Breathwork |
17 | PERSON | Oehen |
19 | LOCATION | Switzerland |
19 | PERSON | Franz Vollenweider |
19 | LOCATION | Switzerland |
19 | PERSON | Rudolf Brenneisen |
19 | LOCATION | Switzerland |
19 | PERSON | Brenneisen |
19 | PERSON | Laboratory Bichsel |
19 | LOCATION | Interlaken |
19 | LOCATION | Switzerland |
20 | PERSON | M. Collenberg |
20 | LOCATION | U.S. |
20 | PERSON | Charles Grob |
28 | PERSON | Wilcoxon |
34 | PERSON | Mauchly |
36 | PERSON | Mauchly |
37 | PERSON | Mauchly |
37 | PERSON | Mauchly |
37 | PERSON | Mauchly |
38 | LOCATION | Switzerland |
38 | PERSON | Bonferroni |
44 | PERSON | Low |
53 | PERSON | Morbus Meulengracht |
60 | PERSON | Weathers |
60 | PERSON | T.M. Keane |
60 | PERSON | J.R. Davidson |
60 | PERSON | Blake |
60 | PERSON | Grof |
60 | PERSON | Ben Lomond |
60 | PERSON | Greer |
60 | PERSON | G.R. |
60 | PERSON | R. Tolbert |
60 | PERSON | Grof |
60 | LOCATION | Albany |
60 | PERSON | Metzner |
60 | PERSON | S. Adamson |
60 | PERSON | J. Holland |
60 | LOCATION | Rochester |
60 | PERSON | Mithoefer |
60 | PERSON | D. Hyman |
60 | PERSON | Berl |
60 | PERSON | Dumont |
60 | PERSON | Cami |
60 | PERSON | J Clin Psychopharmacol |
60 | PERSON | Dumont |
60 | PERSON | G.J. |
60 | PERSON | R.J. Verkes |
60 | PERSON | J Psychopharmacol |
60 | PERSON | C.M. |
60 | PERSON | Kirkpatrick |
60 | PERSON | Berl |
60 | PERSON | Kolbrich |
60 | PERSON | Liechti |
60 | PERSON | Mithoefer |
60 | LOCATION | M.C. |
60 | PERSON | J Psychopharmacol |
60 | PERSON | Tancer |
60 | PERSON | C.E. Johanson |
60 | PERSON | Berl |
61 | PERSON | Yubero-Lahoz |
61 | PERSON | Clin Pharmacokinet |
61 | PERSON | J Psychopharmacol |
61 | PERSON | Johanson |
61 | PERSON | K.P. |
61 | PERSON | J.G. Ramaekers |
61 | PERSON | Berl |
61 | PERSON | Grob |
61 | PERSON | Harris |
61 | PERSON | Berl |
61 | PERSON | Greer |
61 | PERSON | R. Tolbert |
61 | PERSON | Grob |
61 | LOCATION | C.S. |
61 | PERSON | Lester |
61 | PERSON | Ann Intern Med |
61 | PERSON | Peter Oehen |
61 | PERSON | Vollenweider |
61 | PERSON | Tancer |
61 | PERSON | C.E. Johanson |
61 | PERSON | Tancer |
61 | PERSON | C.E. Johanson |
61 | PERSON | Stolaroff |
61 | LOCATION | Sarasota |
61 | LOCATION | New York |
61 | PERSON | Blake |
61 | PERSON | Schnyder |
61 | PERSON | U. |
61 | PERSON | H. Moergeli |
61 | PERSON | Foa |
61 | LOCATION | E.B. |
61 | PERSON | Foa |
61 | LOCATION | E.B. |
61 | PERSON | Newman |
61 | PERSON | Brady |
62 | PERSON | Davidson |
62 | PERSON | Marshall |
62 | PERSON | R.D. |
62 | PERSON | Tucker |
62 | PERSON | Brunner |
62 | PERSON | S. Domhof |
62 | PERSON | F. Langer |
62 | LOCATION | New York |
62 | PERSON | Wiley |
62 | PERSON | Brunner |
62 | PERSON | U. Munzel |
62 | PERSON | Nichtparametrische Datenanalyse |
62 | LOCATION | Berlin |
62 | LOCATION | Los Angeles |
62 | PERSON | Marshall |
62 | PERSON | Weathers |
62 | LOCATION | Los Angeles |
62 | LOCATION | Krakow |
62 | PERSON | M. Hollifield |
62 | PERSON | Jama |