前回までのrPubsページ
- 第1章:準備運動
- 第2章:UNIXコマンドの基礎
- 第3章:正規表現
- 第4章:形態素解析
- 第5章:構文解析
前回引き続き、言語処理100本ノック(2015年版)を解きます。
(下記の『前書き(言語処理100本ノックについて)』は前回と同じです)
前書き(言語処理100本ノックについて)
- 本稿では、東北大学の乾・岡崎研究室で公開されている言語処理100本ノック(2015年版)を、R言語で解いていきます。
- 改訂前の言語処理100本ノックも同様に上記研究室のサイトにあります。
前書き(Rに関して)
- Rの構文や関数についての説明は一切ありませんので、あらかじめご了承ください。
- 本稿では、{base}にある文字列処理ではなく、{stringr}(1.0.0以上)とパイプ処理を極力用いております({stringi}も処理に応じて活用していきます)。課題によってはパイプ処理でこなすのに向かない状況もありますので、あらかじめご了承ください。
- 今回は上記に加え、Stanford CoreNLPを{rJava}で呼び出すパッケージを使用していきます。
参考ページ
{stringr}と{stringi}
hadley/stringr
RPubs - このパッケージがすごい2014: stringr
stringiで輝く☆テキストショリスト
stringr 1.0.0を使ってみる
{stringr}/{stringi}とbaseの文字列処理について
Stanford Core NLP
Stanford CoreNLP
stanfordnlp/CoreNLP
Stanford CoreNLPのannotator一覧
CoreNLPのダウンロードから利用のメモ
{rJava}
rJava - Low-level R to Java interface
s-u/rJava
Getting R and Java 1.8 to work together on OSX
ご意見やご指摘など
- こうした方が良いやこういう便利な関数がある、間違いがあるなど、ご指摘をお待ちしております。
- 下記のいずれかでご連絡・ご報告いただけますと励みになります(なお、Gitに慣れていない人です)。
Twitter, GitHub
R CMD javareconfが「/Library/Java/」以下でなく「/System/Library/」を参照してしまう場合がある# Mac上に複数バージョンが混在している状況
$ /usr/libexec/java_home -V
Matching Java Virtual Machines (4):
1.8.0_51, x86_64: "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home
1.7.0_79, x86_64: "Java SE 7" /Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home
1.6.0_65-b14-466.1, x86_64: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
1.6.0_65-b14-466.1, i386: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home
# {rJava}が参照しているバージョンの確認用
# "1.6.*"を返すのではなく"1.8.*"を返して欲しい
library(rJava)
.jinit()
.jcall("java/lang/System", "S", "getProperty", "java.runtime.version")
# "1.8.*"でないと{rJava}の実行はできても、{StanfordCoreNLP}や{coreNLP}で文字列を与えると"Unsupported major.minor version 52.0 error"が出る
library(StanfordCoreNLP)
x <- paste("Stanford University is located in California.", "It is a great university.")
p <- StanfordCoreNLP::StanfordCoreNLP_Pipeline(NULL)
p(s = x)
# ここでエラーが発生する
.bash_profileに追記
$ export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
– 確認
$ echo $JAVA_HOME
/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home
$ java -version
java version "1.8.0_51"
“JAVA_LD_LIBRARY_PATH”は設定しなくてもいいかも
$ echo $JAVA_LD_LIBRARY_PATH /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/server/
R CMD javareconfの引数で“JNI cpp flags”と“JNI linker flags”を自分でインストールしたJavaのパス(「/Library/Java」以下)にしようとしてもダメだったので、参照先(「/System/Library」以下)にある“CurrentJDK”のシンボリックリンクが1.8系になるようにhomebrewで更新(「最後にインストールしたバージョンでシンボリックリンクが作成されている」という文面があったので、もしかしたらhomebrewでなくても新しくインストールしなおせば良いのかも)。ここで“CurrentJDK”のシンボリックリンクが1.6系になっているとダメ
$ ls -l /System/Library/Frameworks/JavaVM.framework/Versions/
total 64
lrwxr-xr-x 1 root wheel 10 11 30 2014 1.4 -> CurrentJDK
lrwxr-xr-x 1 root wheel 10 11 30 2014 1.4.2 -> CurrentJDK
lrwxr-xr-x 1 root wheel 10 11 30 2014 1.5 -> CurrentJDK
lrwxr-xr-x 1 root wheel 10 11 30 2014 1.5.0 -> CurrentJDK
lrwxr-xr-x 1 root wheel 10 11 30 2014 1.6 -> CurrentJDK
lrwxr-xr-x 1 root wheel 10 11 30 2014 1.6.0 -> CurrentJDK
drwxr-xr-x 8 root wheel 272 11 30 2014 A
lrwxr-xr-x 1 root wheel 1 11 30 2014 Current -> A
lrwxr-xr-x 1 root wheel 58 7 20 06:29 CurrentJDK -> /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contentswarningが出ているが、{rJava}での参照先が8系のシンボリックリンク
$ R CMD javareconf
Java interpreter : /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/bin/java
Java version : 1.8.0_51
Java home path : /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home
Java compiler : /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/bin/javac
Java headers gen.: /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/bin/javah
Java archive tool: /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/bin/jar
System Java on OS X
trying to compile and link a JNI program
detected JNI cpp flags : -I/System/Library/Frameworks/JavaVM.framework/Headers
detected JNI linker flags : -framework JavaVM
clang -I/Library/Frameworks/R.framework/Resources/include -I/System/Library/Frameworks/JavaVM.framework/Headers -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -fPIC -Wall -mtune=core2 -g -O2 -c conftest.c -o conftest.o
conftest.c:4:5: warning: 'JNI_CreateJavaVM' is deprecated [-Wdeprecated-declarations]
JNI_CreateJavaVM(0, 0, 0);
^
/System/Library/Frameworks/JavaVM.framework/Headers/jni.h:1937:1: note: 'JNI_CreateJavaVM' has been explicitly marked
deprecated here
JNI_CreateJavaVM(JavaVM **pvm, void **penv, void *args);
^
1 warning generated.
clang -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o conftest.so conftest.o -framework JavaVM -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
JAVA_HOME : /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home
Java library path: /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/server/
JNI cpp flags : -I/System/Library/Frameworks/JavaVM.framework/Headers
JNI linker flags : -framework JavaVM
Updating Java configuration in /Library/Frameworks/R.framework/Resources
Done.
# devtools::install_github("statsmaths/coreNLP")
# {coreNLP}では事前にStanford CoreNLPのCoreファイルを公式サイトから用意しておく必要がある
# 下記の関数でダウンロードできるが、中身は結構ハードコーディング
# coreNLP::downloadCoreNLP(type = "base")
SET_LOAD_LIB <- c("knitr", "readr", "dplyr", "stringr", "stringi", "lazyeval", "SnowballC", "coreNLP", "xml2", "DiagrammeR")
sapply(X = SET_LOAD_LIB, FUN = library, character.only = TRUE, logical.return = TRUE)
## knitr readr dplyr stringr stringi lazyeval
## TRUE TRUE TRUE TRUE TRUE TRUE
## SnowballC coreNLP xml2 DiagrammeR
## TRUE TRUE TRUE TRUE
knitr::opts_chunk$set(comment = NA)
# 第6章の入力データURL(固定)
TASK_INPUT_URL <- "http://www.cl.ecei.tohoku.ac.jp/nlp100/data/nlp.txt"
# ファイル取得
download.file(
url = TASK_INPUT_URL, destfile = basename(TASK_INPUT_URL),
method = "wget", quiet = FALSE
)
if (file.exists(file = basename(TASK_INPUT_URL))) {
INPUT_TEXT <- readr::read_lines(file = basename(TASK_INPUT_URL), n_max = -1)
} else{
stop("File not found.")
}
(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし,入力された文書を1行1文の形式で出力せよ.
SET_SENTENCE_BREAK_PATTERN <- "([.;:?!])[:blank:]([A-Z])"
input_sentence <- stringr::str_replace_all(
string = INPUT_TEXT,
pattern = SET_SENTENCE_BREAK_PATTERN, replacement = "\\1\n\\2"
) %>%
stringr::str_split(pattern = "\n") %>%
dplyr::combine() %>%
dplyr::data_frame(sentence = .) %>%
print
Source: local data frame [69 x 1]
sentence
1 Natural language processing
2 From Wikipedia, the free encyclopedia
3
4 Natural language processing (NLP) is a field of computer science, artificia
5 As such, NLP is related to the area of humani-computer interaction.
6 Many challenges in NLP involve natural language understanding, that is, ena
7
8 History
9
10 The history of NLP generally starts in the 1950s, although work can be foun
.. ...
空白を単語の区切りとみなし,50の出力を入力として受け取り,1行1単語の形式で出力せよ.ただし,文の終端では空行を出力せよ.
# stringr::str_splitは引数patternの文字列を置き換えてしまうので、文末に規定の文字列を配置してそれで区切る(「文の終端では空行を出力」するため)
SET_SENTENE_END_SYMBOL <- "EOS"
input_word <- stringr::str_split(
string = stringr::str_c(input_sentence$sentence, SET_SENTENE_END_SYMBOL, sep = " "),
pattern = stringr::boundary(type = "word", skip_word_none = TRUE)
) %>%
dplyr::combine() %>%
stringr::str_replace_all(pattern = SET_SENTENE_END_SYMBOL, replacement = "") %>%
dplyr::data_frame(word = .) %>%
print
Source: local data frame [1,357 x 1]
word
1 Natural
2 language
3 processing
4
5 From
6 Wikipedia
7 the
8 free
9 encyclopedia
10
.. ...
51の出力を入力として受け取り,Porterのステミングアルゴリズムを適用し,単語と語幹をタブ区切り形式で出力せよ. Pythonでは,Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい.
SET_SEP_STRING <- "\t"
# {SnowballC}にある"porter"を使う
SnowballC::getStemLanguages()
[1] "danish" "dutch" "english" "finnish" "french"
[6] "german" "hungarian" "italian" "norwegian" "porter"
[11] "portuguese" "romanian" "russian" "spanish" "swedish"
[16] "turkish"
input_word %>%
dplyr::mutate(stem = SnowballC::wordStem(words = .$word, language = "porter")) %>%
dplyr::mutate(
word_stem = ifelse(
test = word != "",
yes = stringr::str_c(as.character(.$word), as.character(.$stem), sep = SET_SEP_STRING),
no = .$word
)
)
Source: local data frame [1,357 x 3]
word stem word_stem
1 Natural Natur Natural\tNatur
2 language languag language\tlanguag
3 processing process processing\tprocess
4
5 From From From\tFrom
6 Wikipedia Wikipedia Wikipedia\tWikipedia
7 the the the\tthe
8 free free free\tfree
9 encyclopedia encyclopedia encyclopedia\tencyclopedia
10
.. ... ... ...
Stanford Core NLPを用い,入力テキストの解析結果をXML形式で得よ.また,このXMLファイルを読み込み,入力テキストを1行1単語の形式で出力せよ.
SET_USE_ANNOTATORS <- c("tokenize", "ssplit", "pos", "lemma", "ner", "parse", "depparse", "dcoref")
# 今回は{coreNLP}を使ってStanford Core NLPの結果を受け取る
coreNLP::initCoreNLP(
libLoc = stringr::str_c(
system.file("extdata", package = "coreNLP"), "/stanford-corenlp-full-2015-04-20"
),
mem = "3g",
annotators = SET_USE_ANNOTATORS
)
# XMLをパースして単語のみを抽出
xml2::read_xml(
x = coreNLP::annotateString(text = input_sentence$sentence, format = "xml"),
encoding = "UTF-8"
) %>%
xml2::xml_find_all(".//word") %>%
xml2::xml_text(trim = TRUE) %>%
dplyr::data_frame(word = .)
Source: local data frame [1,452 x 1]
word
1 Natural
2 language
3 processing
4 From
5 Wikipedia
6 ,
7 the
8 free
9 encyclopedia
10 Natural
.. ...
# 引数formatを"obj"にしてtokenだけ抽出すると楽
anno_obj <- coreNLP::annotateString(text = input_sentence$sentence, format = "obj")
anno_obj$token %>%
dplyr::select(token) %>%
dplyr::as_data_frame()
Source: local data frame [1,452 x 1]
token
1 Natural
2 language
3 processing
4 From
5 Wikipedia
6 ,
7 the
8 free
9 encyclopedia
10 Natural
.. ...
Stanford Core NLPの解析結果XMLを読み込み,単語,レンマ,品詞をタブ区切り形式で出力せよ.
# annotation objectを利用
anno_obj$token %>%
dplyr::mutate(pos_tagging = stringr::str_c(token, lemma, POS, sep = SET_SEP_STRING)) %>%
dplyr::select(pos_tagging) %>%
dplyr::as_data_frame()
Source: local data frame [1,452 x 1]
pos_tagging
1 Natural\tnatural\tJJ
2 language\tlanguage\tNN
3 processing\tprocessing\tNN
4 From\tfrom\tIN
5 Wikipedia\tWikipedia\tNNP
6 ,\t,\t,
7 the\tthe\tDT
8 free\tfree\tJJ
9 encyclopedia\tencyclopedia\tNN
10 Natural\tnatural\tJJ
.. ...
入力文中の人名をすべて抜き出せ.
# 人名以外も含まれているが、不要な語も
anno_obj$token %>%
dplyr::filter(
POS == "NNP" &
!is.element(NER, c("ORGANIZATION", "ORGANIZATION", "MISC", "LOCATION")) &
Speaker != "which" &
token != stringr::str_to_upper(token) &
!stringr::str_detect(string = token, pattern = "^[A-Z].*?[A-Z]")
) %>%
dplyr::select(-CharacterOffsetBegin, -CharacterOffsetEnd, -Speaker)
sentence id token lemma POS NER
1 5 4 Alan Alan NNP PERSON
2 5 5 Turing Turing NNP PERSON
3 10 40 Joseph Joseph NNP PERSON
4 10 41 Weizenbaum Weizenbaum NNP PERSON
5 15 5 Schank Schank NNP PERSON
6 15 12 Cullingford Cullingford NNP O
7 15 19 Wilensky Wilensky NNP PERSON
8 15 26 Meehan Meehan NNP PERSON
9 15 33 Lehnert Lehnert NNP PERSON
10 15 40 Carbonell Carbonell NNP PERSON
11 15 49 Lehnert Lehnert NNP PERSON
12 16 12 Racter Racter NNP O
13 16 15 Jabberwacky Jabberwacky NNP PERSON
14 19 14 Moore Moore NNP PERSON
15 21 11 Hidden Hidden NNP O
16 21 12 Markov Markov NNP O
17 32 5 Modern Modern NNP O
18 43 1 Automatic Automatic NNP O
19 49 14 Learning Learning NNP O
20 50 28 Computational Computational NNP O
# 「NER == "PERSON"」では洩れるものもある(ex. Cullingford)
anno_obj$token %>%
dplyr::filter(NER == "PERSON") %>%
dplyr::select(-CharacterOffsetBegin, -CharacterOffsetEnd, -Speaker)
sentence id token lemma POS NER
1 5 4 Alan Alan NNP PERSON
2 5 5 Turing Turing NNP PERSON
3 10 40 Joseph Joseph NNP PERSON
4 10 41 Weizenbaum Weizenbaum NNP PERSON
5 15 3 MARGIE MARGIE NNP PERSON
6 15 5 Schank Schank NNP PERSON
7 15 19 Wilensky Wilensky NNP PERSON
8 15 26 Meehan Meehan NNP PERSON
9 15 33 Lehnert Lehnert NNP PERSON
10 15 40 Carbonell Carbonell NNP PERSON
11 15 49 Lehnert Lehnert NNP PERSON
12 16 15 Jabberwacky Jabberwacky NNP PERSON
13 19 14 Moore Moore NNP PERSON
Stanford Core NLPの共参照解析の結果に基づき,文中の参照表現(mention)を代表参照表現(representative mention)に置換せよ.ただし,置換するときは,「代表参照表現(参照表現)」のように,元の参照表現が分かるように配慮せよ.
# 1文ずつ処理しないと共参照の範囲が広くて挙動がおかしいような
coref_res <- do.call(
"rbind",
lapply(X = seq(from = 1, to = length(input_sentence$sentence)), function (i) {
input_xml <- xml2::read_xml(
x = coreNLP::annotateString(text = input_sentence$sentence[i], format = "xml"),
encoding = "UTF-8"
)
sentence <- input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/sentence") %>%
xml2::xml_text(trim = TRUE)
return(
dplyr::data_frame(
sentence = numeric(length = length(sentence)) + i - 1,
start = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/start") %>%
xml2::xml_text(trim = TRUE),
end = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/end") %>%
xml2::xml_text(trim = TRUE),
head = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/head") %>%
xml2::xml_text(trim = TRUE),
text = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/text") %>%
xml2::xml_text(trim = TRUE),
is_representative = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention") %>%
xml2::xml_attr(attr = "representative")
)
)
})
)
# 比較用
input_xml <- xml2::read_xml(
x = coreNLP::annotateString(text = input_sentence$sentence, format = "xml"),
encoding = "UTF-8"
)
all_coref_res <- dplyr::left_join(
x = dplyr::data_frame(
sentence = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/sentence") %>%
xml2::xml_text(trim = TRUE),
start = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/start") %>%
xml2::xml_text(trim = TRUE),
end = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/end") %>%
xml2::xml_text(trim = TRUE),
head = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/head") %>%
xml2::xml_text(trim = TRUE),
text = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention/text") %>%
xml2::xml_text(trim = TRUE),
is_representative = input_xml %>%
xml2::xml_find_all(xpath = ".//coreference/coreference/mention") %>%
xml2::xml_attr(attr = "representative")
),
y = anno_obj$coref,
by = c("sentence", "start", "end", "head", "text")
)
coref_res %>%
dplyr::filter(sentence == 1) %>%
as.data.frame
sentence start end head text is_representative
1 1 2 3 2 Wikipedia true
2 1 4 7 6 the free encyclopedia <NA>
all_coref_res %>%
dplyr::filter(sentence == 1) %>%
as.data.frame
sentence start end head
1 1 7 16 12
2 1 17 22 18
3 1 33 34 33
text
1 the free encyclopedia Natural language processing -LRB- NLP -RRB-
2 a field of computer science
3 computers
is_representative corefId
1 true 1
2 <NA> 1
3 true 2
# 一文ずつ入力していく方で解く
# {coreNLP}のgetCoreference関数だとrepresentativeがない
token_coref_res <- dplyr::left_join(
x = do.call(
"rbind",
lapply(X = seq(from = 1, to = length(input_sentence$sentence)), function (i) {
token <- coreNLP::annotateString(text = input_sentence$sentence[i], format = "obj")$token
if(!is.null(token)) {
token$sentence <- i
}
return(token)
})
),
y = coref_res %>%
dplyr::group_by(sentence) %>%
dplyr::mutate(replace_text = dplyr::lag(x = text, n = 1)) %>%
dplyr::ungroup(.) %>%
dplyr::mutate(
sentence = sentence + 1,
end = as.integer(end) - 1
),
by = c("sentence", "id" = "start")
)
# FIX: "for"の削除
representative_res <- lapply(
X = unique(token_coref_res$sentence[!is.na(token_coref_res$replace_text)]),
FUN = function (rep_i) {
each_token_coref_res <- token_coref_res %>%
dplyr::filter(sentence == rep_i)
merge_str <- c()
replace_flag <- 0
cut_end_idx <- NULL
for(tid in as.integer(each_token_coref_res$id)) {
if (!is.na(each_token_coref_res$replace_text[tid])) {
merge_str <- append(
merge_str,
stringr::str_c(
each_token_coref_res$replace_text[tid],
"(", each_token_coref_res$text[tid], ")"
)
)
replace_flag <- 1
cut_end_idx <- as.integer(each_token_coref_res$end[tid])
} else if (replace_flag == 0) {
merge_str <- append(merge_str, each_token_coref_res$token[tid])
} else {
if (tid > cut_end_idx) {
replace_flag <- 0
merge_str <- append(merge_str, each_token_coref_res$token[tid])
}
}
}
return(stringr::str_c(merge_str, collapse = " "))
}
) %>%
dplyr::combine() %>%
print
[1] "From Wikipedia , Wikipedia(the free encyclopedia)"
[2] "In 1950 , Alan Turing published an article titled `` Computing Machinery and Intelligence '' which proposed what is now called the Alan Turing(Turing) test as a criterion of intelligence ."
[3] "The authors claimed that within three or five years , a solved problem(machine translation) would be a solved problem ."
[4] "Some notably successful NLP systems developed in the 1960s were SHRDLU , SHRDLU(a natural language system working in restricted `` blocks worlds '' with restricted vocabularies) , and ELIZA , a simulation of a Rogerian psychotherapist , written by Joseph Weizenbaum between 1964 to 1966 ."
[5] "When the `` patient '' exceeded the very small knowledge base , ELIZA might provide a generic response , for example , responding to `` the `` patient ''(My) the `` patient ''(My) My(My head) head hurts '' with `` Why do your head(you) say say My head(your head) you(your) head hurts When"
[6] "Examples are MARGIE -LRB- Schank , 1975 -RRB- , SAM -LRB- Cullingford , Lehnert(1978) -RRB- , PAM -LRB- Wilensky , 1978(1978) -RRB- , TaleSpin -LRB- Meehan , 1976 -RRB- , QUALM -LRB- Lehnert 1981(Lehnert) , 1977 -RRB- , Politics -LRB- Carbonell , 1979 -RRB- , and Plot Units -LRB- Lehnert 1981 -RRB- ."
[7] "The cache language models upon which many speech recognition systems now rely are The cache language models upon which many speech recognition systems now rely(examples of such statistical models) ."
[8] "However , most other systems depended on corpora specifically developed for the tasks implemented by these these these systems , which was -LRB- and often continues to be -RRB-(these systems) systems , which was -LRB- and often continues to be -RRB- a major limitation in the success of these systems(these systems) systems"
[9] "a set of documents -LRB- or sometimes , individual sentences -RRB- that have been hand-annotated with the correct values to be learned(A corpus -LRB- plural , `` corpora '' -RRB-) is a set of documents -LRB- or sometimes , individual sentences -RRB- that have been hand-annotated with the correct values to be learned ."
[10] "Generally , handling such input gracefully with hand-written rules -- or more generally , creating systems of hand-written rules(hand-written rules) that make soft decisions -- extremely difficult , error-prone and time-consuming ."
[11] "However , systems based on hand-written rules can only be made more accurate by increasing the complexity of hand-written rules(the rules) , which is a much more difficult task ."
[12] "The subfield of NLP devoted to learning approaches is known as Natural Language Learning -LRB- NLL -RRB- and Natural Language Learning -LRB- NLL -RRB-(its) conference CoNLL and peak body SIGNLL are sponsored by ACL , recognizing also their links with Computational Linguistics and Language Acquisition ."
Stanford Core NLPの係り受け解析の結果(collapsed-dependencies)を有向グラフとして可視化せよ.可視化には,係り受け木をDOT言語に変換し,Graphvizを用いるとよい.また,Pythonから有向グラフを直接的に可視化するには,pydotを使うとよい.
# 44. 「係り受け解析木の可視化」を利用
createDiagrammeGraph <- function (
from, to,
DIAGRAM_GRAPH_PARAM
) {
nodes <- DiagrammeR::create_nodes(
nodes = unique(from, to),
label = DIAGRAM_GRAPH_PARAM$NODE$LABEL,
style = DIAGRAM_GRAPH_PARAM$NODE$STYLE, color = DIAGRAM_GRAPH_PARAM$NODE$COLOR,
shape = rep(DIAGRAM_GRAPH_PARAM$NODE$SHAPE, length = length(unique(from, to)))
)
edges <- DiagrammeR::create_edges(
from = from, to = to,
relationship = DIAGRAM_GRAPH_PARAM$EDGE$RELATIONSHIP
)
return(
DiagrammeR::create_graph(
nodes_df = nodes, edges_df = edges,
node_attrs = DIAGRAM_GRAPH_PARAM$ATTRS$NODE,
edge_attrs = DIAGRAM_GRAPH_PARAM$ATTRS$EDGE,
directed = DIAGRAM_GRAPH_PARAM$ATTRS$IS_DIRECTED,
graph_name = DIAGRAM_GRAPH_PARAM$ATTRS$GRAPH_NAME
)
)
}
SET_DIAGRAM_GRAPH_PARAM <- list(
NODE = list(
LABEL = TRUE, STYLE = "filled", COLOR = "white", SHAPE = "rectangle"
),
EDGE = list(
RELATIONSHIP = "connected_to"
),
ATTRS = list(
NODE = "fontname = Helvetica", EDGE = c("color = gray", "arrowsize = 1"),
IS_DIRECTED = TRUE,
GRAPH_NAME = "dependency_tree"
)
)
SET_PLOT_SENTENCE_ID <- 1
plot_dependencies <- coreNLP::getDependency(annotation = anno_obj, type = "collapsed") %>%
dplyr::filter(sentence == SET_PLOT_SENTENCE_ID) %>%
dplyr::filter(!stringr::str_detect(string = .$type, pattern = "^root$|^punct$"))
# 親子関係を反対にして可視化
diagramme_graph <- createDiagrammeGraph(
from = plot_dependencies$dependent,
to = plot_dependencies$governor,
DIAGRAM_GRAPH_PARAM = SET_DIAGRAM_GRAPH_PARAM
)
DiagrammeR::render_graph(graph = diagramme_graph, output = "graph")
Stanford Core NLPの係り受け解析の結果(collapsed-dependencies)に基づき,「主語 述語 目的語」の組をタブ区切り形式で出力せよ.ただし,主語,述語,目的語の定義は以下を参考にせよ. - 述語: nsubj関係とdobj関係の子(dependant)を持つ単語
- 主語: 述語からnsubj関係にある子(dependent)
- 目的語: 述語からdobj関係にある子(dependent)
# nsubj([主語]) -> [述語] -> dobj([目的語])
include_nsub_dobj <- coreNLP::getDependency(annotation = anno_obj, type = "collapsed") %>%
dplyr::filter(stringr::str_detect(string = .$type, pattern = "^nsubj$|^dobj$"))
dplyr::left_join(
x = include_nsub_dobj %>%
dplyr::select(-dependentIdx, -govIndex, -depIndex) %>%
dplyr::rename(g_governor = governor, g_dependent = dependent, g_type = type),
y = include_nsub_dobj %>%
dplyr::select(-govIndex, -depIndex, -governorIdx) %>%
dplyr::rename(d_governor = governor, d_dependent = dependent, d_type = type),
by = c("sentence", "g_governor" = "d_governor")
) %>%
dplyr::filter(.$g_type == "nsubj" & .$d_type == "dobj") %>%
dplyr::select(sentence, subj = g_dependent, predicate = g_governor, obj = d_dependent) %>%
dplyr::as_data_frame()
Source: local data frame [18 x 4]
sentence subj predicate obj
1 3 challenges involve generation
2 3 understanding involve generation
3 3 that enabling computers
4 5 Turing published article
5 11 ELIZA provided interaction
6 12 patient exceeded base
7 12 ELIZA provide response
8 14 which structured information
9 19 underpinnings discouraged sort
10 19 that underlies approach
11 23 that contains errors
12 34 implementations involved coding
13 38 algorithms take input
14 41 models have advantage
15 41 they express certainty
16 42 Systems have advantages
17 43 procedures make use
18 44 that make decisions
Stanford Core NLPの句構造解析の結果(S式)を読み込み,文中のすべての名詞句(NP)を表示せよ.入れ子になっている名詞句もすべて表示すること.
# 「ターゲットの"(NP"」から
# "(NP"の後方で"(NP"時点での括弧数よりも小さい位置(paren_lower_idx) + 「ターゲットの"(NP" - 1」まで
# を抽出
parseSExpression <- function (
parsed_sentence,
target_pattern = "\\(NP"
) {
each_sentence <- stringr::str_split(
string = stringr::str_replace_all(
string = parsed_sentence, pattern = "\\(ROOT\n|\\(S\n", replacement = ""
),
pattern = stringr::boundary(type = "sentence")
) %>%
dplyr::combine() %>%
stringr::str_trim(side = "both") %>%
stringi::stri_flatten(collapse = " ") %>%
stringr::str_split(pattern = " ") %>%
dplyr::combine()
paren_counter <- (
each_sentence %>%
stringr::str_count(pattern = "\\(") %>%
cumsum
) -
(each_sentence %>%
stringr::str_count(pattern = "\\)") %>%
cumsum
)
np_start_idx <- each_sentence %>%
stringr::str_detect(pattern = target_pattern) %>%
which
index_counter <- 1
return(
sapply(seq(from = 1, to = length(np_start_idx)), function (index_counter) {
target_np_backward_sequence <- seq(
from = np_start_idx[index_counter], to = length(paren_counter)
)
paren_lower_idx <- which.max(
paren_counter[np_start_idx[index_counter]] > paren_counter[target_np_backward_sequence]
)
np_range <- seq(
from = np_start_idx[index_counter],
to = paren_lower_idx + (np_start_idx[index_counter] - 1)
)
return(stringr::str_c(each_sentence[np_range], collapse = " "))
})
)
}
SET_PARSE_PHRASE_STRUCTURE_SENTENCE_ID <- 1
parseSExpression(
parsed_sentence = anno_obj$parse[SET_PARSE_PHRASE_STRUCTURE_SENTENCE_ID],
target_pattern = "\\(NP"
)
[1] "(NP (JJ Natural) (NN language) (NN processing))"
[2] "(NP (NNP Wikipedia)))"
[3] "(NP (NP (DT the) (JJ free) (NN encyclopedia) (JJ Natural) (NN language) (NN processing)) (PRN (-LRB- -LRB-) (NP (NN NLP)) (-RRB- -RRB-)))"
[4] "(NP (DT the) (JJ free) (NN encyclopedia) (JJ Natural) (NN language) (NN processing))"
[5] "(NP (NN NLP))"
[6] "(NP (NP (NP (DT a) (NN field)) (PP (IN of) (NP (NN computer) (NN science)))) (, ,) (NP (JJ artificial) (NN intelligence)) (, ,) (CC and) (NP (NP (NNS linguistics)) (VP (VBN concerned) (PP (IN with) (NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages))))))))))"
[7] "(NP (NP (DT a) (NN field)) (PP (IN of) (NP (NN computer) (NN science))))"
[8] "(NP (DT a) (NN field))"
[9] "(NP (NN computer) (NN science))))"
[10] "(NP (JJ artificial) (NN intelligence))"
[11] "(NP (NP (NNS linguistics)) (VP (VBN concerned) (PP (IN with) (NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages))))))))))"
[12] "(NP (NNS linguistics))"
[13] "(NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages))))))))))"
[14] "(NP (DT the) (NNS interactions))"
[15] "(NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages))))))))))"
[16] "(NP (NNS computers))"
[17] "(NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages))))))))))"
言語処理100本ノック(2015年版)の「英語テキストの処理」の章をやってみました。
Stanford CoreNLPは言語処理の「全部入りパッケージ」で、今回使わなかった機能(感情表現)もあるのでいろいろと遊べると思います。興味がある方は是非。
– 自然言語処理における「全部入り」パッケージ
今回は{coreNLP}にてStanford CoreNLPを呼んでいますが、{StanfordCoreNLP}というパッケージもあります(こちらも{rJava}を使ってします)。
– (インストールは下記のコードにて)
– install.packages("StanfordCoreNLP", repos = "http://datacube.wu.ac.at", type = "source")
– 引数reposで指定したサイトでは、言語処理のパッケージ(モデルファイル)が公開されています。
– http://datacube.wu.ac.at
他にもApache OpenNLPのプロジェクトによる成果を活用した{NLP}や{openNLP}でも英語テキストを処理でき、こちらでも課題を解くのにも挑戦したいです(今回使った{coreNLP}は使い勝手がよくなかったので、{rJava}から自前でStanford CoreNLPを呼び出すのも検討)。
個人的には、問題を解くよりも{rJava}の環境の設定がとても苦労しました。
library(devtools)
devtools::session_info()
Session info --------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, darwin13.4.0
ui X11
language (EN)
collate ja_JP.UTF-8
tz Asia/Tokyo
Packages ------------------------------------------------------------------
package * version date
assertthat * 0.1 2013-12-06
coreNLP 0.4-1 2015-07-19
DBI * 0.3.1 2014-09-24
devtools 1.7.0 2015-01-17
DiagrammeR 0.7 2015-07-03
digest * 0.6.8 2014-12-31
dplyr 0.4.2.9000 2015-06-17
evaluate * 0.7 2015-04-21
formatR * 1.2 2015-04-21
htmltools * 0.2.6 2014-09-08
htmlwidgets * 0.5 2015-06-26
jsonlite * 0.9.16 2015-04-11
knitr 1.10 2015-04-23
lazyeval 0.1.10.9000 2015-06-07
magrittr * 1.5 2014-11-22
plotrix * 3.5-11 2015-01-21
R6 * 2.0.1 2014-10-29
Rcpp * 0.11.6 2015-05-01
readr 0.1.0.9000 2015-06-08
rJava * 0.9-7 2015-07-19
rmarkdown * 0.6.2.4 2015-06-07
rstudioapi * 0.3.1 2015-04-07
SnowballC 0.5.1 2014-08-09
stringi 0.4-1 2014-12-14
stringr 1.0.0 2015-04-30
XML * 3.98-1.1 2013-06-20
xml2 0.1.0 2015-04-20
yaml * 2.1.13 2014-06-12
source
CRAN (R 3.2.0)
Github (statsmaths/coreNLP@3a667c6)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (rich-iannone/DiagrammeR@de3c120)
CRAN (R 3.2.0)
Github (hadley/dplyr@7763150)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (ramnathv/htmlwidgets@955ddc0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (hadley/lazyeval@ecb8dc0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (hadley/readr@9006822)
local
Github (rstudio/rmarkdown@8c9e25b)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)