前回までのrPubsページ
- 第1章:準備運動
- 第2章:UNIXコマンドの基礎
- 第3章:正規表現
- 第4章:形態素解析
- 第5章:構文解析
- 第6章:英語テキストの処理
- 第7章:データベース
- 第8章:機械学習
前回引き続き、言語処理100本ノック(2015年版)を解きます。
(下記の『前書き(言語処理100本ノックについて)』は前回と同じです)
前書き(言語処理100本ノックについて)
- 本稿では、東北大学の乾・岡崎研究室で公開されている言語処理100本ノック(2015年版)を、R言語で解いていきます。
- 改訂前の言語処理100本ノックも同様に上記研究室のサイトにあります。
前書き(Rに関して)
- Rの構文や関数についての説明は一切ありませんので、あらかじめご了承ください。
- 本稿では、{base}にある文字列処理ではなく、{stringr}(1.0.0以上)とパイプ処理を極力用いております({stringi}も処理に応じて活用していきます)。課題によってはパイプ処理でこなすのに向かない状況もありますので、あらかじめご了承ください。
- 今回は上記に加え、{pforeach}を用いて入力や集計を並列処理し、{Matrix}で疎行列を処理していきます。
参考ページ
{stringr}と{stringi}
hadley/stringr
RPubs - このパッケージがすごい2014: stringr
stringiで輝く☆テキストショリスト
stringr 1.0.0を使ってみる
{iterators}と{foreach}
Using The iterators Package
Writing Custom Iterators
Using The foreach Package
Looping and iterators, part I
foreachパッケージで並列化する時、現在の“環境”にない変数・関数は、明示的に.export引数にて指定しなければならない
forを捨てよ、foreachを書こう
{pforeach}
hoxo-m/pforeach
Rで超簡単に並列処理を書けるパッケージpforeach を作った
Rで超簡単に並列処理を書けるパッケージpforeach
{Matrix}
Introduction to the Matrix package
Sparse Model Matrices
疎な行列をRで扱う -R tips-
Rで疎な多次元配列を扱う(Matrix,slamパッケージ)
Rによる文書分類入門
ご意見やご指摘など
- こうした方が良いやこういう便利な関数がある、間違いがあるなど、ご指摘をお待ちしております。
- 下記のいずれかでご連絡・ご報告いただけますと励みになります(なお、Gitに慣れていない人です)。
Twitter, GitHub
# devtools::install_github("aaboyles/hadleyverse")
SET_LOAD_LIB <- c("knitr", "hadleyverse", "stringi", "lazyeval", "pforeach", "countrycode", "Matrix", "irlba", "Rcpp", "inline")
sapply(X = SET_LOAD_LIB, FUN = library, character.only = TRUE, logical.return = TRUE)
## knitr hadleyverse stringi lazyeval pforeach countrycode
## TRUE TRUE TRUE TRUE TRUE TRUE
## Matrix irlba Rcpp inline
## TRUE TRUE TRUE TRUE
load_packages <- as.character(na.omit(object = stringr::str_match(string = search(), pattern = "^package:(.*)")[, 2]))
knitr::opts_chunk$set(comment = NA)
# 第9章の入力データURL(固定)
# TASK_INPUT_URL <- "http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r10-105752.txt.bz2"
TASK_INPUT_URL <- "http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2"
# ファイル取得
download.file(
url = TASK_INPUT_URL, destfile = basename(TASK_INPUT_URL),
method = "wget", quiet = FALSE
)
if (!file.exists(file = basename(TASK_INPUT_URL))) {
stop("File not found.")
}
文を単語列に変換する最も単純な方法は,空白文字で単語に区切ることである. ただ,この方法では文末のピリオドや括弧などの記号が単語に含まれてしまう. そこで,コーパスの各行のテキストを空白文字でトークンのリストに分割した後,各トークンに以下の処理を施し,単語から記号を除去せよ.
- トークンの先頭と末尾に出現する次の文字を削除: .,!?;:()[]'"
- 空文字列となったトークンは削除
以上の処理を適用した後,トークンをスペースで連結してファイルに保存せよ.
tokenizeSentence <- function (
str, pattern_list
) {
return(
dplyr::data_frame(
text = stringr::str_replace_all(
string = stringr::str_split(string = str, pattern = pattern_list$SPLIT_SEP) %>%
dplyr::combine(),
pattern = stringr::str_c(
c(
stringr::str_c("^", pattern_list$DELETE_CHAR),
stringr::str_c(pattern_list$DELETE_CHAR, "$")
),
collapse = "|"
),
replacement = ""
)
) %>%
dplyr::filter(.$text != pattern_list$IS_NULL) %>%
dplyr::combine(.) %>%
unlist %>%
stringr::str_c(collapse = pattern_list$JOIN_SEP) %>%
dplyr::data_frame(text = .)
)
}
SET_PARALLE <- list(IS_PARALLEL = TRUE, CORE = 3, ITERATORS_CHUNK_SIZE = 5000)
SET_CHAR_PATTERN <- list(
SPLIT_SEP = "[:space:]",
DELETE_CHAR = c("\\.", "\\,", "\\!", "\\?", "\\;", "\\:", "\\(", "\\)", "\\[", "\\]", "\\'", "\""),
IS_NULL = "",
JOIN_SEP = " "
)
input_con <- file(description = basename(TASK_INPUT_URL), open = "r")
iter_file <- iterators::ireadLines(
con = input_con,
n = SET_PARALLE$ITERATORS_CHUNK_SIZE
)
formatted_text <- pforeach::pforeach(
read_df = iter_file,
.c = rbind,
.export = c("tokenizeSentence", "SET_CHAR_PATTERN"),
.packages = load_packages,
.parallel = SET_PARALLE$IS_PARALLEL, .cores = SET_PARALLE$CORE
)({
data.frame(text = read_df, stringsAsFactors = FALSE) %>%
dplyr::rowwise(.) %>%
dplyr::do(., tokenizeSentence(str = .$text, pattern_list = SET_CHAR_PATTERN))
}) %>%
print
Source: local data frame [252,694 x 1]
Groups: <by row>
text
1 Anarchism
2 Anarchism is a political philosophy that advocates stateless societies ofte
3 As a subtle and anti-dogmatic philosophy anarchism draws on many currents o
4 The central tendency of anarchism as a social movement has been represented
5 Etymology and terminology
6 The term anarchism is a compound word composed from the word anarchy and th
7 History
8 Origins
9 The earliest anarchist themes can be found in the 6th century BC among the
10 The French renaissance political philosopher Étienne de La Boétie wrote in
.. ...
close(input_con)
英語では,複数の語の連接が意味を成すことがある.例えば,アメリカ合衆国は“United States”,イギリスは“United Kingdom”と表現されるが,“United”や“States”,“Kingdom”という単語だけでは,指し示している概念・実体が曖昧である.そこで,コーパス中に含まれる複合語を認識し,複合語を1語として扱うことで,複合語の意味を推定したい.しかしながら,複合語を正確に認定するのは大変むずかしいので,ここでは複合語からなる国名を認定したい.
インターネット上から国名リストを各自で入手し,80のコーパス中に出現する複合語の国名に関して,スペースをアンダーバーに置換せよ.例えば,“United States”は“United_States”,“Isle of Man”は“Isle_of_Man”になるはずである.
# 国名リストは{countrycode}のデータを使う
coutry_name <- stringr::str_replace(
string = countrycode_data$country.name, pattern = "\\(.*\\)", replacement = ""
)
replace_coutry_name <- stringr::str_replace_all(
string = coutry_name, pattern = " ", replacement = "_"
)
names(replace_coutry_name) <- coutry_name
# 国名を含む文だけ処理する
is_include_coutry <- sapply(
X = lapply(X = formatted_text$text, FUN = stringr::str_detect, pattern = coutry_name),
FUN = any
)
formatted_text$text[is_include_coutry] <- stringr::str_replace_all(
str = formatted_text$text[is_include_coutry], pattern = replace_coutry_name
)
# ある程度の文字列長の国名を表示
formatted_text$text[is_include_coutry][
stringr::str_length(string = formatted_text$text[is_include_coutry]) < 15
]
[1] "India" "Germany" "Bavaria"
[4] "Japanese" "Ancient Greece" "Canada"
[7] "New_Zealand" "Pakistan" "India"
[10] "United_States" "Goa and India" "Japan"
[13] "Hungary" "Andorra" "Armenia"
[16] "Austria" "Azerbaijan" "Belarus"
[19] "Belgium" "Bulgaria" "Canada"
[22] "Croatia" "Czech_Republic" "Denmark"
[25] "Finland" "France" "Georgia"
[28] "Germany" "Greece" "Hungary"
[31] "Iceland" "Ireland" "Italy"
[34] "Kazakhstan" "Kyrgyzstan" "Spain"
[37] "Sweden" "United_Kingdom" "Canada"
[40] "United_States" "Brazil" "Chile"
[43] "France" "La Malinche" "New_Caledonia"
[46] "Canada" "United_Kingdom" "United_States"
[49] "United_Kingdom" "France" "United_States"
[52] "New_Zealand" "Germany" "Japanese"
[55] "Japanese" "Seychelles" "Antarctica"
[58] "Canada" "Costa_Rica" "France"
[61] "Indonesia" "Japan" "Malaysia"
[64] "Mexico" "New_Zealand" "Singapore"
[67] "United_Kingdom" "United_States" "Germany"
[70] "Sweden" "United_States" "Logic in India"
[73] "Logic in China" "United_States" "Nigeria"
[76] "Senegal" "Somalia" "Uganda"
[79] "China" "Hong_Kong" "Japan"
[82] "Korea" "India" "Philippines"
[85] "Morocco" "Israel" "Croatia"
[88] "Denmark" "Finland" "France"
[91] "Greece" "Hungary" "Italy"
[94] "Latvia" "Malta" "Netherlands"
[97] "Norway" "Poland" "Serbia"
[100] "Spain" "Sweden" "Switzerland"
[103] "Turkey" "Cuba" "Jamaica"
[106] "Mexico" "Panama" "Costa_Rica"
[109] "Nicaragua" "Guatemala" "Honduras"
[112] "Australia" "Fiji" "New_Zealand"
[115] "India" "Boston Indiana" "Peru Maine"
[118] "Germany" "Japan" "Peru Vermont"
[121] "Move to Italy" "Tour de France" "Reunion 1995"
[124] "Tour de France" "King of Jordan" "Namur Belgium"
[127] "Kosovo War" "Iraq" "Israel"
[130] "China" "Australia" "United_Kingdom"
[133] "United_States" "Palau Islands" "China"
[136] "Japan" "Korea" "Mongolia"
[139] "Singapore" "United_States" "Spain"
[142] "Greece" "India" "China"
[145] "Mexico" "New France" "United_States"
[148] "Brazil" "India" "Bangladesh"
[151] "Pakistan" "Nepal" "Sri_Lanka"
[154] "Afghanistan" "Turkey" "Azerbaijan"
[157] "Tajikistan" "Egypt" "Morocco"
[160] "Serbia" "Reunion" "Australia"
[163] "New_Zealand" "South_Africa" "China"
[166] "Bochum Germany" "Japanese tour" "South_Africa"
[169] "Hong_Kong" "Singapore" "United_States"
[172] "Canada" "United_Kingdom" "Germany"
[175] "India" "Armenia" "Latvia"
[178] "Lithuania" "Slovakia" "Spain"
[181] "Mexico" "Argentina" "Colombia"
[184] "Peru" "Germany" "United_Kingdom"
[187] "United_States" "France" "Italy"
[190] "China" "Japan" "War with Spain"
[193] "United_States" "United_States" "Canada"
[196] "Chile" "Honduras" "Mexico"
[199] "Brazil" "Austria" "France"
[202] "Ireland" "Italy" "Netherlands"
[205] "United_Kingdom" "Japan" "Israel"
[208] "Australia" "New_Zealand" "United_States"
[211] "Cyprus" "Egypt" "Hungary"
[214] "Morocco" "India" "United_States"
[217] "Japanese" "Avro Canada" "Mali"
[220] "Kenya" "Gabon" "Botswana"
[223] "Burkina_Faso" "Egypt" "Kosovo"
[226] "Montenegro" "Greece" "Canada"
[229] "France" "Germany" "Italy"
[232] "Malaysia" "New_Zealand" "Philippines"
[235] "Poland" "Portugal" "Spain"
[238] "Sri_Lanka" "Thailand" "Tunisia"
[241] "United_States" "Canada" "Cuba"
[244] "Haiti" "Canada" "Mexico"
[247] "United_States" "Canada" "Australia"
[250] "United_States" "United_Kingdom" "United_States"
[253] "United_States" "Jersey Numbers" "United_Kingdom"
[256] "India" "Ireland" "Puerto_Rico"
[259] "Northern India" "Southern India" "Western India"
[262] "Eastern India" "Spain" "China"
[265] "Iraq War" "United_Kingdom" "United_States"
[268] "HMS Fiji 58" "Korean War" "Lithuanians"
[271] "1920 in France" "Malta" "Korean War"
[274] "Australia" "Canada" "France"
[277] "Germany" "Japan" "Netherlands"
[280] "New_Zealand" "Rwanda" "Switzerland"
[283] "United_Kingdom" "United_States" "Ireland"
[286] "United_States" "Australia" "Denmark"
[289] "South China" "South China" "Ottoman Greece"
[292] "New_Zealand" "South_Africa" "France 3"
[295] "Korean War" "Indonesia" "Egypt"
[298] "Reunion" "United_States" "Canada"
[301] "Germany" "Monaco" "Zimbabwe"
[304] "United_States" "Canada" "Thailand"
[307] "United_Kingdom" "United_States" "United_Kingdom"
[310] "United_States" "Canada" "Mexico"
[313] "Sweden" "Ireland" "Italy"
[316] "Malta" "Mexico" "Australia"
[319] "Australia" "Belgium" "Canada"
[322] "Chile" "China" "Ecuador"
[325] "France" "Japan" "Netherlands"
[328] "New_Zealand" "Poland" "South_Africa"
[331] "Turkey" "United_Kingdom" "United_States"
[334] "Zimbabwe" "Kosovo–Resava" "United_States"
[337] "Canada" "In India" "Korean War"
[340] "South_Africa" "Argentina" "Australia"
[343] "Belgium" "Brazil" "Italy"
[346] "Israel" "New_Zealand" "Pakistan"
[349] "Portugal" "Philippines" "Qatar"
[352] "United_Kingdom" "United_States" "Thailand"
[355] "Turkey" "United_States" "United_Kingdom"
[358] "Japan" "Australia" "India"
[361] "Israel" "Malaysia" "North Korea"
[364] "United_Kingdom" "United_States" "Puerto_Rico"
[367] "United_Kingdom" "Netherlands" "France"
[370] "Reunion\"" "United_States" "United_Kingdom"
[373] "Japan only" "Japan only" "Japan"
[376] "Little Italy" "Australia" "Austria"
[379] "Hungary" "North India" "United_States"
[382] "United_Kingdom" "Belgium" "New_Zealand"
[385] "Ralph Jordan" "Armenian" "Indiana"
[388] "Duke of Parma" "Niger in 1974" "United_States"
[391] "Australia" "Doug Ireland" "Indonesia"
[394] "Philippines" "Kabri Israel" "Kylie Ireland"
[397] "Francesinha" "Ireland" "Romania"
[400] "Korean War" "United_States" "Poland"
[403] "Germany" "Turkey" "India"
[406] "Japan" "Australia" "Argentina"
[409] "South_Africa" "Canada" "Chile"
[412] "United_States" "Canada" "United_Kingdom"
[415] "Australia" "In Georgian" "In Belarusian"
[418] "In Croatian" "In Romanian" "In Estonian"
[421] "In Bulgarian" "Korean" "Japanese"
[424] "United_States" "Beer in Israel" "Belgium"
[427] "Australia" "Canada" "United_States"
[430] "United_Kingdom" "Finland" "Canada"
[433] "Greece" "Iceland" "Japan"
[436] "New_Zealand" "United_Kingdom" "United_States"
[439] "France" "Germany" "Japan"
[442] "United_Kingdom" "United_States" "Croatia"
[445] "Cyprus" "Czech_Republic" "France"
[448] "Germany" "Greece" "Hungary"
[451] "Ireland" "Italy" "Malta"
[454] "Poland" "Portugal" "Serbia"
[457] "Spain" "Turkey" "Ukraine"
[460] "United_Kingdom" "Canada" "United_States"
[463] "Mexico" "South_Africa" "Tunisia"
[466] "Fiji" "Australia" "New_Zealand"
[469] "Antarctica" "Zimbabwe" "Ireland"
[472] "Australian CD" "Air Jordans" "Israel"
[475] "ED250 Poland)" "Guatemala Clan" "Australia"
[478] "Canada" "Ireland" "Japan"
[481] "Malaysia" "New_Zealand" "Norway"
[484] "Slovakia" "South_Africa" "Spain"
[487] "United_Kingdom" "United_States" "United_States"
[490] "United_States" "In Albania" "Funke v France"
[493] "Georgia" "Indiana" "New Jersey"
[496] "New Mexico" "India" "Japan"
[499] "Lebanon" "Philippines" "Austria"
[502] "Czech_Republic" "Denmark" "Finland"
[505] "France" "Germany" "Iceland"
[508] "Ireland" "Italy" "Netherlands"
[511] "Norway" "Poland" "Portugal"
[514] "Romania" "Slovakia" "Slovenia"
[517] "Spain" "Sweden" "United_Kingdom"
[520] "Canada" "Honduras" "Mexico"
[523] "United_States" "Australia" "New_Zealand"
[526] "Brazil" "Peru" "Media of Sudan"
[529] "China" "Japan" "Korea"
[532] "Singapore" "United_States" "Australia"
[535] "Armenian" "Egypt" "Perugia"
[538] "Perugia" "Bank of Canada" "O Canada"
[541] "Guam" "Japan" "Korean War"
[544] "In Zimbabwe" "Zambia" "Spain"
[547] "Greenland" "Indian College" "Sweden"
[550] "In Israel" "By Israelis" "Indian Agent"
[553] "Paraguayan War" "France" "Mongolian name"
[556] "Mongolia" "Inner Mongolia" "Ukraine"
[559] "Estonia" "Czech_Republic" "In Korea"
[562] "Korean War" "Afghanistan" "Iraq"
[565] "Oia Greece" "Ilija Slovakia" "Cuban art"
[568] "Iceland" "Greenland" "Norway"
[571] "Sweden" "Denmark" "Korean War"
[574] "Morocco" "East Pakistan" "Emma of Italy"
[577] "Japanese ska" "Indian squad" "Italy"
[580] "New Guinea" "Korean War" "Arts of China"
[583] "Argentina" "Brazil" "Chile"
[586] "Colombia" "Peru" "SOS Israel"
[589] "Queen of Spain" "India 2007)" "India 2008)"
[592] "India 2014)" "United_Kingdom" "United_States"
[595] "Haiti" "United_States" "North Koreans"
[598] "South Koreans" "Sulm Austria" "South China"
[601] "• Albanian" "• Latvian" "• Romanian"
[604] "• Serbian" "Italy" "Montenegro"
[607] "Parma Panthers" "Malik Raja" "Australia"
[610] "Italy" "United_Kingdom" "Poland"
[613] "Uganda" "United_States" "Greece"
[616] "Yugoslavia" "Poland" "Czechoslovakia"
[619] "Bulgaria" "Romania" "Hungary"
[622] "Peru also is" "vs Fiji" "Bavarian Cup"
[625] "Time in Ghana" "France" "United_States"
[628] "Australia" "Australia" "Belize"
[631] "Canada" "France" "New_Zealand"
[634] "Philippines" "United_Kingdom" "United_States"
[637] "New Jersey" "Mexico U23" "United_States"
[640] "Mexico" "New_Zealand" "Colombia"
[643] "Viet_Nam War" "Tonga" "In Canada"
[646] "Aki Polandwood" "Australia" "Fiji"
[649] "Kiribati" "Nauru" "New_Zealand"
[652] "Palau" "Samoa" "Tonga"
[655] "Tuvalu" "Vanuatu" "and Japanese"
[658] "United_States" "Spain" "Mexico"
[661] "Canada" "Mexico" "China"
[664] "United_Kingdom" "Indonesia" "Thailand"
[667] "Japan" "Australia" "Australia"
[670] "United_Kingdom" "Canada" "Japan"
[673] "United_States" "Libya" "Ireland"
[676] "France" "Albania" "Czechoslovakia"
[679] "Hong_Kong" "Hungary" "Indonesia"
[682] "Japan" "Poland" "Romania"
[685] "Serbia" "South Korea" "United_States"
[688] "United_States" "After Bermuda" "Georgia"
[691] "New Jersey" "United_Kingdom" "United_States"
[694] "Switzerland" "Wild Turkey" "United_States"
[697] "American_Samoa" "Lower Austria" "Upper Austria"
[700] "United_States" "Mozambique" "Thailand"
[703] "Gulf of Mexico" "Australia" "Canada"
[706] "Greece" "India" "Israel"
[709] "Mexico" "New_Zealand" "South_Africa"
[712] "United_Kingdom" "United_States" "South_Africa"
[715] "East Indians" "Portugal" "Albania"
[718] "Austria" "Belgium" "Bulgaria"
[721] "Croatia" "Cyprus" "Czech_Republic"
[724] "Denmark" "Estonia" "Finland"
[727] "France" "Germany" "Greece"
[730] "Hungary" "Iceland" "Ireland"
[733] "Italy" "Kosovo" "Latvia"
[736] "Lithuania" "Luxembourg" "Malta"
[739] "Monaco" "Montenegro" "Netherlands"
[742] "Norway" "Poland" "Portugal"
[745] "Romania" "Serbia" "Slovenia"
[748] "Spain" "Sweden" "Switzerland"
[751] "Gibraltar" "Georgia State" "Liberia"
[754] "Reach Canada" "New Guinea" "Philippines"
[757] "Hong_Kong" "Philippines" "Trip to Japan"
[760] "France" "Germany" "Australia"
[763] "Canada" "Israel" "New_Zealand"
[766] "United_States" "Haiti 1802" "Brazil"
[769] "Reunion Gummer" "Korea" "Hong_Kong"
[772] "Australia" "Bahrain" "Canada"
[775] "Cyprus" "Jersey" "Mauritius"
[778] "South_Africa" "United_States" "Finland"
[781] "Ireland" "Israel" "Netherlands"
[784] "Tour de France" "Puerto_Rico" "Japan"
[787] "Brazil Gloob" "Portugal RTP2" "Origins Canada"
[790] "India" "Iraq" "Japan"
[793] "Libya" "United_States" "Israel"
[796] "North Korea" "Canada" "Australia"
[799] "Caen France" "GAC Ireland" "Jordan Obita"
[802] "Brad Jordan" "Taylah Jordan" "Israeli Druze"
[805] "In Israel" "Jordan Kovacs" "New Jersey"
[808] "Korean War" "Norway" "Iraq"
[811] "Greece" "Denmark" "Spain"
[814] "Yugoslavia" "Romania" "Sweden"
[817] "Belgium" "Luxembourg" "Parma"
[820] "Japan" "Brazil" "Canada"
[823] "Australia" "Ireland" "Hockey Canada"
[826] "Italy" "France" "Greece"
[829] "Georgia Force" "New Jersey" "South_Africa"
[832] "Korean War" "Japan" "United_States"
[835] "Pakistan" "India" "Libyan people"
[838] "Italy" "Japan" "India"
[841] "In Australia" "In New_Zealand" "Canada"
[844] "Croatia" "France" "India"
[847] "Italy" "Morocco" "Norway"
[850] "Peru" "Portugal" "Spain"
[853] "Turkey" "United_States" "Frances Hugle"
[856] "Korean War" "Canada" "China"
[859] "Germany" "India" "Israel"
[862] "Japan" "Norway" "b Ghana’s Gold"
[865] "Israel" "France" "United_Kingdom"
[868] "Liechtenstein" "Spain" "United_States"
[871] "Israel" "Peru" "Burundi"
[874] "Haiti" "Mexico" "Thailand"
[877] "Korean War" "Malin Diaz" "Germany - Puma"
[880] "Turkey - Altay" "Korean War" "Indian IS 1293"
[883] "Tunisia" "Oman" "Italy"
[886] "Japanese" "In Estonia" "In Latvia"
[889] "Visit to Italy" "East Germany" "Cuban Thaw"
81で作成したコーパス中に出現するすべての単語tに関して,単語tと文脈語cのペアをタブ区切り形式ですべて書き出せ.ただし,文脈語の定義は次の通りとする.
- ある単語tの前後d単語を文脈語cとして抽出する(ただし,文脈語に単語tそのものは含まない)
- 単語tを選ぶ度に,文脈幅dは{1,2,3,4,5}の範囲でランダムに決める.
# word, type(前 or 後), window, context_word
fetchContextWord <- function (
str, window_size, seed = NULL
){
str_len <- length(str)
if (is.null(seed)){
set.seed(seed = seed)
}
context_window <- sample.int(n = window_size, size = str_len, replace = TRUE)
word_window <- seq(from = 1, to = str_len) + context_window
word_window[word_window > str_len] <- str_len
# 単語の後側d語
right_word <- stringr::str_split_fixed(
string = stringr::word(
string = stringi::stri_flatten(str = str, collapse = " "),
start = seq(from = 1, to = str_len), end = word_window,
sep = " "
),
pattern = " ",
n = context_window + 1
)
if (ncol(right_word) < window_size + 1) {
right_word <- cbind(
right_word,
matrix(data = "", nrow = nrow(right_word), ncol = (window_size + 1) - ncol(right_word))
)
}
# 単語の前側d語
left_word <- stringr::str_split_fixed(
string = stringr::word(
string = stringi::stri_flatten(str = str[seq(from = str_len, to = 1)], collapse = " "),
start = seq(from = 1, to = str_len), end = word_window,
sep = " "
),
pattern = " ",
n = context_window + 1
)[seq(from = str_len, to = 1), , drop = FALSE]
if (ncol(left_word) < window_size + 1) {
left_word <- cbind(
left_word,
matrix(data = "", nrow = nrow(left_word), ncol = (window_size + 1) - ncol(left_word))
)
}
right_window <- left_window <- seq(from = 1, to = window_size + 1)
names(x = right_window) <-names(x = left_window) <- c("word", seq(from = 1, to = window_size))
return(
dplyr::bind_rows(
data.frame(right_word, stringsAsFactors = FALSE) %>%
dplyr::rename_(
.dots = setNames(
object = stringr::str_c("X", right_window),
nm = names(right_window)
)
) %>%
dplyr::mutate(type = "right"),
data.frame(left_word, stringsAsFactors = FALSE) %>%
dplyr::rename_(
.dots = setNames(
object = stringr::str_c("X", left_window),
nm = names(left_window)
)
) %>%
dplyr::mutate(type = "left")
) %>%
tidyr::gather_(
key_col = c("window"),
gather_cols = as.character(seq(from = 1, to = window_size)),
value_col = c("context_word")
)
)
}
SET_SEED <- 71
SET_WINDOW_SIZE <- 5
SET_PARALLE_SPLIT_SIZE <- 5000
SET_WRITE_FILE_NAME <- "term_context.tsv"
# 今回は文の順番は関係ないタスクなので、乱数を用いて文を分割して処理
formatted_text$split <- sort(
x = sample.int(
n = SET_PARALLE_SPLIT_SIZE, size = nrow(formatted_text),
replace = TRUE
)
)
formatted_text_iter <- iterators::isplit(x = formatted_text$text, f = formatted_text$split)
word_context_word <- pforeach::pforeach(
p_read_df = formatted_text_iter,
.c = rbind,
.export = c("fetchContextWord", "SET_WINDOW_SIZE", "SET_SEED", "SET_WRITE_FILE_NAME"),
.packages = load_packages,
.parallel = SET_PARALLE$IS_PARALLEL, .cores = SET_PARALLE$CORE,
.inorder = FALSE
)({
dplyr::data_frame(text = p_read_df$value) %>%
dplyr::rowwise(.) %>%
dplyr::do(.,
fetchContextWord(
str = stringr::str_split(string = .$text, pattern = "[:blank:]") %>%
unlist,
window_size = SET_WINDOW_SIZE,
seed = SET_SEED
)
) %>%
dplyr::filter(word != "") %>%
dplyr::filter(context_word != "") %>%
dplyr::select(word, context_word) %>%
readr::write_tsv(path = SET_WRITE_FILE_NAME, append = TRUE)
})
word_context_word <- readr::read_tsv(
file = SET_WRITE_FILE_NAME, n_max = -1,
col_names = c("word", "context_word")
) %>%
print
|================================================================================| 100% 782 MB
Warning: 719 problems parsing 'term_context.tsv'. See problems(...) for
more details.
Source: local data frame [68,078,486 x 2]
word context_word
1 Anarchism is
2 is a
3 a political
4 political philosophy
5 philosophy that
6 that advocates
7 advocates stateless
8 stateless societies
9 societies often
10 often defined
.. ... ...
readr::problems(x = word_context_word)
Source: local data frame [719 x 4]
row col expected actual
1 1169592 3 Only 2 columns
2 1171327 3 Only 2 columns
3 1175827 3 Only 2 columns
4 1178457 3 Only 2 columns
5 1180217 3 Only 2 columns
6 1182196 3 Only 2 columns
7 1183243 3 Only 2 columns
8 1185986 3 Only 2 columns
9 1186978 3 Only 2 columns
10 1188287 3 Only 2 columns
.. ... ... ... ...
word_context_word <- na.omit(word_context_word)
82の出力を利用し,以下の出現分布,および定数を求めよ.
- f(t,c): 単語tと文脈語cの共起回数
- f(t,∗): 単語tの出現回数
- f(∗,c): 文脈語cの出現回数
- N: 単語と文脈語のペアの総出現回数
# f(t,c)
co_word_pair <- word_context_word %>%
dplyr::group_by(word, context_word) %>%
dplyr::summarize(tc = n()) %>%
print
Source: local data frame [21,748,822 x 3]
Groups: word
word
1 \this\nform\tmaterial\nto\tform\nfeed\tto\nfor\tfeed\nthe\tfor\nlast\tthe\ntime\tlast\nfrom
2 !
3 !
4 !
5 !
6 !
7 !
8 !
9 !
10 !
.. ...
Variables not shown: context_word (chr), tc (int)
# f(t,∗)
word_freq <- word_context_word %>%
dplyr::ungroup() %>%
dplyr::group_by(word) %>%
dplyr::summarize(t = n()) %>%
print
Source: local data frame [432,329 x 2]
word
1 \this\nform\tmaterial\nto\tform\nfeed\tto\nfor\tfeed\nthe\tfor\nlast\tthe\ntime\tlast\nfrom
2 !
3 !*@
4 !,
5 "
6 ""
7 "")
8 ""Abid"
9 ""Allerheiligen-Wasserfälle"")
10 ""Banque
.. ...
Variables not shown: t (int)
# f(∗,c)
context_freq <- word_context_word %>%
dplyr::ungroup() %>%
dplyr::group_by(context_word) %>%
dplyr::summarize(c = n()) %>%
print
Source: local data frame [432,863 x 2]
context_word c
1 ! 25
2 !*@ 5
3 !, 6
4 " 1890
5 "" 57
6 "") 34
7 ""Abid" 7
8 ""Allerheiligen-Wasserfälle"") 4
9 ""Banque 8
10 ""Dampf"") 7
.. ... ...
# N
nrow(co_word_pair)
[1] 21748822
83の出力を利用し,単語文脈行列Xを作成せよ.ただし,行列Xの各要素Xtcは次のように定義する.
- f(t,c) ≥ 10ならば,\(X_{tc}\)=PPMI(t,c)=\(max \{log\frac{N \times f(t,c)}{f(t,∗) \times f(∗,c)},0 \}\)
- f(t,c) < 10ならば,\(X_{tc}\)=0
ここで,PPMI(t,c)はPositive Pointwise Mutual Information(正の相互情報量)と呼ばれる統計量である.なお,行列Xの行数・列数は数百万オーダとなり,行列のすべての要素を主記憶上に載せることは無理なので注意すること.幸い,行列Xのほとんどの要素は0になるので,非0の要素だけを書き出せばよい.
SET_THRESHOLD <- 10
ppmi <- suppressWarnings(
dplyr::left_join(
x = dplyr::left_join(
x = co_word_pair %>%
dplyr::filter(tc >= SET_THRESHOLD),
y = word_freq,
by = "word"
),
y = context_freq,
by = "context_word"
) %>%
dplyr::mutate(ppmi = log((nrow(co_word_pair) * tc) / (t * c)))
)
# 桁あふれによるNaN, NA化した値をゼロに変換
ppmi$ppmi <- ifelse(ppmi$ppmi > 0, ppmi$ppmi, 0)
ppmi <- ppmi %>%
dplyr::select(word, context_word, ppmi) %>%
replace(is.na(.), 0)
# 疎行列化(単純にtidyr::spreadをすると、メモリ消費が大きい)
occur_words <- unique(word_freq$word)
ppmi$word <- factor(ppmi$word, levels = as.character(occur_words))
ppmi$context_word <- factor(ppmi$context_word, levels = as.character(occur_words))
word_context_ppmi <- Matrix::sparseMatrix(
i = as.integer(ppmi$word), j = as.integer(ppmi$context_word),
x = ppmi$ppmi,
dims = c(length(occur_words), length(occur_words)),
dimnames = list(occur_words, occur_words)
)
84で得られた単語文脈行列に対して,主成分分析を適用し,単語の意味ベクトルを300次元に圧縮せよ.
SET_DIM_NUM <- 300
# 疎行列に対応している特異値分解(SVD)の関数を使う
# ここではさらに左・右の特異ベクトルを近似した手法を用いる
# (理解が足らないので「所感」で触れた資料や書籍で勉強し直す)
lsi_res <- irlba::irlba(
A = word_context_ppmi,
nu = SET_DIM_NUM, nv = SET_DIM_NUM
)
word_sence <- t(
t(lsi_res$u[, seq(from = 1, to = SET_DIM_NUM), drop = FALSE]) %*% word_context_ppmi
)
rownames(word_sence) <- rownames(word_context_ppmi)
85で得た単語の意味ベクトルを読み込み,“United States”のベクトルを表示せよ.ただし,“United States”は内部的には“United_States”と表現されていることに注意せよ.
SET_SEARCH_WORD <- "United_States"
# 計算しておいた意味ベクトルから行名でマッチング
word_sence[is.element(rownames(word_sence), SET_SEARCH_WORD), ]
[1] 2.037838e+00 -9.375344e-01 -3.106189e-01 -1.605345e-01 -5.904840e-02
[6] 6.518050e-01 2.602280e-02 5.277793e-02 -4.068961e-01 2.610499e-01
[11] 3.918249e-01 -4.765525e-02 6.441821e-02 3.308307e-01 -7.000656e-01
[16] -1.136182e+00 -7.595355e-01 1.312867e+00 2.849231e-01 -5.978687e-01
[21] -2.377491e-01 2.204588e-01 -4.045608e-01 7.395452e-01 1.753993e-01
[26] -5.848825e-01 7.487949e-02 -8.271620e-02 3.057006e-01 -2.858766e-01
[31] -1.626025e-14 -7.482628e-02 5.950323e-01 4.345851e-01 1.774483e-01
[36] 3.518586e-01 -1.127549e-01 1.978918e-02 -2.124593e-02 2.632636e-01
[41] 1.988843e-03 -1.085607e-01 -5.579423e-02 -6.353527e-01 -8.574127e-02
[46] -2.702903e-01 -1.086915e-01 -9.561545e-02 3.506378e-01 -1.136618e-01
[51] -4.273408e-02 2.581737e-01 4.037139e-01 -5.753142e-02 -4.511197e-02
[56] 3.837357e-01 -3.667011e-01 -5.704303e-01 4.766412e-01 -5.312001e-02
[61] 5.940559e-01 -9.484519e-01 8.184152e-01 -2.463707e-01 4.159756e-01
[66] 8.324350e-01 1.780600e-01 -2.443294e-01 -2.646356e-02 1.453572e-01
[71] 2.839435e-01 2.313725e-01 6.264211e-01 -3.446654e-01 6.386369e-01
[76] 3.006067e-02 -3.472549e-01 5.245093e-01 6.549072e-01 2.267178e-01
[81] -1.744061e-01 1.544763e-01 -1.885759e-01 3.261650e-01 1.818196e-01
[86] -5.395743e-01 5.581458e-01 -5.187504e-01 1.807359e-01 -3.550431e-01
[91] -1.276461e-02 -8.716716e-02 -1.975487e-01 1.522283e-01 3.874773e-01
[96] 1.084189e-02 1.789052e-01 -3.046082e-02 4.413375e-02 8.527400e-02
[101] 1.860033e-01 -1.483027e-02 8.239071e-01 2.443411e-01 -1.125405e-01
[106] -7.860227e-02 2.509777e-01 4.514155e-03 -4.874264e-01 4.044995e-02
[111] 2.337708e-01 -4.252291e-01 -8.265042e-01 -6.330464e-01 -1.257250e-01
[116] -1.402304e-01 5.298778e-01 8.078769e-02 1.996961e-01 -1.518596e-01
[121] 4.562999e-01 -5.120750e-01 1.661834e-01 -3.112895e-02 1.371840e-01
[126] 2.215744e-01 -4.915658e-01 -5.562550e-02 4.136874e-01 2.518635e-01
[131] -1.942698e-01 4.185125e-01 2.838158e-01 -1.023195e+00 2.331462e-01
[136] 3.405410e-01 2.619637e-01 1.839301e-01 4.260956e-03 -7.053745e-02
[141] -3.222604e-02 1.703391e-01 -1.908760e-01 -7.297733e-01 3.924142e-01
[146] 2.141091e-01 -8.826797e-02 -1.787676e-12 -3.463508e-01 -2.527450e-01
[151] 1.264967e-01 2.092116e-01 -1.912777e-01 2.862594e-01 -1.797915e-01
[156] 3.921730e-03 -1.882014e-01 3.413298e-01 7.014191e-02 -4.041935e-02
[161] 3.021720e-01 -5.929334e-02 -1.431572e-01 -4.116285e-01 2.270619e-01
[166] -1.322969e-01 -4.940011e-01 -3.803055e-01 -1.353691e-01 -5.702957e-01
[171] -1.425753e-01 2.804969e-01 9.874140e-02 -5.764361e-01 -1.951386e-01
[176] 3.810475e-01 -1.868904e-01 3.822573e-01 -6.042654e-02 -2.721631e-01
[181] 3.634367e-01 2.552693e-01 -2.499335e-02 1.852346e-01 6.149274e-01
[186] -2.711030e-01 -1.853175e-01 -1.172159e-01 -3.391866e-01 -3.626279e-02
[191] 5.045489e-01 -2.980203e-01 2.137921e-01 -4.796995e-01 2.405535e-01
[196] -9.349896e-01 5.770628e-02 -4.376850e-01 -3.033761e-01 -3.721689e-01
[201] -4.449024e-01 -4.249434e-01 -2.055324e-02 2.367674e-01 5.078755e-01
[206] 2.513147e-01 1.643741e-01 -1.901829e-01 5.284053e-03 -2.717752e-01
[211] -1.348662e-01 1.102604e-01 -1.819928e-01 3.032791e-01 8.319814e-02
[216] -1.158638e-01 -2.505669e-01 -2.838608e-03 -1.407063e-01 4.587667e-01
[221] 4.361180e-01 -3.722908e-01 -6.295600e-02 -4.343454e-02 -6.277148e-02
[226] -1.120107e-01 -1.999548e-01 -2.116093e-01 -7.791790e-01 -5.385965e-02
[231] -2.015585e-01 -2.973695e-01 -9.012819e-02 7.317029e-02 5.602437e-01
[236] -3.891752e-02 -7.414676e-01 3.921536e-02 2.230800e-01 -5.484260e-01
[241] 9.380090e-02 3.474563e-01 4.400047e-03 -8.169646e-01 -6.850005e-02
[246] -1.643937e-01 -2.460247e-12 -4.393261e-02 -5.913047e-01 -4.294057e-02
[251] -2.508756e-02 5.760611e-02 -2.234026e-01 -3.073265e-02 -3.955376e-02
[256] -1.963136e-01 -3.460861e-01 8.795343e-02 4.492767e-01 3.553542e-01
[261] -7.870723e-01 -4.611821e-01 1.242627e-01 1.111002e-02 4.436061e-01
[266] 5.281246e-01 2.755074e-01 -4.890112e-01 1.931703e-01 -2.591270e-02
[271] -2.714101e-01 -3.758363e-01 5.954379e-01 1.409803e-01 6.588941e-01
[276] 1.283339e-01 1.405396e-01 -2.381568e-01 -3.367582e-01 -1.644492e-01
[281] 4.049569e-02 2.608836e-01 -3.598677e-01 8.669979e-01 -1.865976e-01
[286] 3.423840e-01 -1.029759e-01 2.729978e-01 3.205047e-01 -2.795269e-01
[291] -1.308581e-01 6.806350e-02 1.276574e-01 9.465147e-02 -1.063264e-01
[296] 8.932431e-02 -2.399941e-01 3.492260e-01 -3.412677e-01 -3.299891e-01
85で得た単語の意味ベクトルを読み込み,“United States”と“U.S.”のコサイン類似度を計算せよ.ただし,“U.S.”は内部的に“U.S”と表現されていることに注意せよ.
# コサイン類似度を{Rcpp}で書く
# 今回は使わない
rcpp_cosine_sim <- '
NumericVector calcCosineSim(NumericMatrix ipt_mat) {
using namespace Rcpp;
Environment base("package:base");
Function Rcrossprod = base["crossprod"];
Function Rtranspose = base["t"];
Function Router = base["outer"];
NumericMatrix numerator = Rcrossprod(Rtranspose(ipt_mat));
NumericVector denominator = diag(numerator);
return Rcpp::wrap(numerator / sqrt(Router(denominator, denominator)));
}
'
calcCosine <- Rcpp::cppFunction(code = rcpp_cosine_sim, plugins = "cpp11")
# コサイン類似度の必要な行だけを取り出す(N * Nの行列の出力を避ける)
filterCosineSim <- function (
seed_word_vector, target_word_vectors,
extract_rownames = NULL
) {
word_vectors <- rbind(seed_word_vector, target_word_vectors)
numerator <- crossprod(x = t(x = word_vectors))
denominator <- diag(numerator)
return((numerator / sqrt(outer(denominator, denominator)))[extract_rownames, ])
}
SET_COMPARE_WORDS <- c("United_States", "^U\\.S$")
filterCosineSim(
seed_word_vector = word_sence[
is.element(rownames(word_sence), SET_COMPARE_WORDS[1]), , drop = FALSE
],
target_word_vectors = word_sence[
stringr::str_detect(
string = rownames(word_sence), pattern = SET_COMPARE_WORDS[-1]
), , drop = FALSE
],
extract_rownames = SET_COMPARE_WORDS[1]
)
United_States U.S
1.0000000 0.7838752
85で得た単語の意味ベクトルを読み込み,“England”とコサイン類似度が高い10語と,その類似度を出力せよ.
# 適当な乱数を割り当てて、一度にコサイン類似度を求めるベクトル数を減らす
fetchCosineSimilarity <- function(
seed_word_vector, target_words_sence,
seed_word_name,
split_size
){
# 疎行列から行列へ変換
seed_word_vector <- t(apply(X = seed_word_vector, MARGIN = 1, FUN = as.matrix))
target_words_sence <- data.frame(
t(apply(X = target_words_sence, MARGIN = 1, FUN = as.matrix)),
stringsAsFactors = FALSE
)
target_words_sence$split <- sample.int(
n = split_size, size = nrow(target_words_sence),
replace = TRUE
)
fetch_cs <- lapply(
X = split(
x = target_words_sence[, !is.element(colnames(target_words_sence), "split")],
f = target_words_sence$split
),
FUN = function (target_sence) {
cosine_sim_res <- filterCosineSim(
seed_word_vector = seed_word_vector,
target_word_vectors = as.matrix(target_sence),
extract_rownames = seed_word_name
)[-1]
return(cosine_sim_res[!is.nan(x = cosine_sim_res)])
}
)
cs_names <- dplyr::combine(sapply(X = fetch_cs, FUN = names))
fetch_cs <- dplyr::combine(fetch_cs)
names(fetch_cs) <- cs_names
return(fetch_cs)
}
TASK_EXTRACT_NUM <- 10
SET_SEED_WORD <- c("England")
SET_SPLIT_SIZE <- as.integer(nrow(word_sence) / 5000)
# 並列処理しなくても充分だったのでしていない
fetch_cs <- fetchCosineSimilarity(
seed_word_vector = word_sence[is.element(rownames(word_sence), SET_SEED_WORD), , drop = FALSE],
target_words_sence = word_sence[!is.element(rownames(word_sence), SET_SEED_WORD), , drop = FALSE],
seed_word_name = SET_SEED_WORD,
split_size = SET_SPLIT_SIZE
)
# 上位10語を表示
sort(fetch_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]
Cheshire Scotland Wildside Italy NWA
0.6423472 0.5718693 0.5519467 0.4917252 0.4814824
United_Kingdom Japan Germany Wales Ireland
0.4802970 0.4760204 0.4742704 0.4617970 0.4571639
85で得た単語の意味ベクトルを読み込み,vec(“Spain”) - vec(“Madrid”) + vec(“Athens”)を計算し,そのベクトルと類似度の高い10語とその類似度を出力せよ.
# 意味ベクトルを演算
# 「+」と「-」のみの演算子に対応(他の演算子はスルーするので注意)
createArithmeticWordVector <- function (
word_sence, def_arithmetic
) {
return(
colSums(
do.call(
what = "rbind",
args = lapply(
X = names(def_arithmetic),
FUN = function (each_arithmetic) {
return(
switch(EXPR = as.character(def_arithmetic[each_arithmetic]),
"+" = + word_sence[each_arithmetic, ],
"-" = - word_sence[each_arithmetic, ]
)
)
}
)
)
)
)
}
TASK_EXTRACT_NUM <- 10
SET_DEF_ARITHMETIC <- c("Spain" = "+", "Madrid" = "-", "Athens" = "+")
SET_SPLIT_SIZE <- as.integer(nrow(word_sence) / 5000)
# 演算対象から仮の名前を生成して、それで計算結果を絞り込み
create_arithmtic_word_name <- stringr::str_c(names(SET_DEF_ARITHMETIC), collapse = "_")
fetch_arithmetic_cs <- fetchCosineSimilarity(
seed_word_vector = matrix(
data = createArithmeticWordVector(
word_sence = word_sence, def_arithmetic = SET_DEF_ARITHMETIC
),
nrow = 1, ncol = ncol(word_sence),
dimnames = list(create_arithmtic_word_name, NULL)
),
target_words_sence = word_sence,
seed_word_name = create_arithmtic_word_name,
split_size = SET_SPLIT_SIZE
)
# 上位10語を表示
sort(fetch_arithmetic_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]
Spain Austria Netherlands Belgium Brussels
0.9176357 0.9098281 0.9011710 0.8996146 0.8956092
Antwerp Télévisions Turkey France Hungary
0.8954350 0.8952880 0.8932235 0.8909760 0.8909016
言語処理100本ノック(2015年版)のベクトル空間法(I)の章をやってみました(データ量を減らして逃げてしまいました)。
{Matrix}の疎行列のクラスには他にもdgRMatrixやdgTMatrixなどがありますが、どういうデータにはどういうクラスが向いていて、どういう風に変換しているかは今後調査します(上級ハンドブックに記載あり)。
showClass("sparseMatrix")
Virtual Class "sparseMatrix" [package "Matrix"]
Slots:
Name: Dim Dimnames
Class: integer list
Extends:
Class "Matrix", directly
Class "mMatrix", by class "Matrix", distance 2
Known Subclasses:
Class "diagonalMatrix", directly
Class "TsparseMatrix", directly
Class "CsparseMatrix", directly
Class "RsparseMatrix", directly
Class "dsparseMatrix", directly
Class "lsparseMatrix", directly
Class "nsparseMatrix", directly
Class "indMatrix", directly
Class "ddiMatrix", by class "diagonalMatrix", distance 2
Class "ldiMatrix", by class "diagonalMatrix", distance 2
Class "dgTMatrix", by class "TsparseMatrix", distance 2
Class "dtTMatrix", by class "TsparseMatrix", distance 2
Class "dsTMatrix", by class "TsparseMatrix", distance 2
Class "lgTMatrix", by class "TsparseMatrix", distance 2
Class "ltTMatrix", by class "TsparseMatrix", distance 2
Class "lsTMatrix", by class "TsparseMatrix", distance 2
Class "ngTMatrix", by class "TsparseMatrix", distance 2
Class "ntTMatrix", by class "TsparseMatrix", distance 2
Class "nsTMatrix", by class "TsparseMatrix", distance 2
Class "dgCMatrix", by class "CsparseMatrix", distance 2
Class "dtCMatrix", by class "CsparseMatrix", distance 2
Class "dsCMatrix", by class "CsparseMatrix", distance 2
Class "lgCMatrix", by class "CsparseMatrix", distance 2
Class "ltCMatrix", by class "CsparseMatrix", distance 2
Class "lsCMatrix", by class "CsparseMatrix", distance 2
Class "ngCMatrix", by class "CsparseMatrix", distance 2
Class "ntCMatrix", by class "CsparseMatrix", distance 2
Class "nsCMatrix", by class "CsparseMatrix", distance 2
Class "dgRMatrix", by class "RsparseMatrix", distance 2
Class "dtRMatrix", by class "RsparseMatrix", distance 2
Class "dsRMatrix", by class "RsparseMatrix", distance 2
Class "lgRMatrix", by class "RsparseMatrix", distance 2
Class "ltRMatrix", by class "RsparseMatrix", distance 2
Class "lsRMatrix", by class "RsparseMatrix", distance 2
Class "ngRMatrix", by class "RsparseMatrix", distance 2
Class "ntRMatrix", by class "RsparseMatrix", distance 2
Class "nsRMatrix", by class "RsparseMatrix", distance 2
Class "dgTMatrix", by class "dsparseMatrix", distance 2
Class "dtTMatrix", by class "dsparseMatrix", distance 2
Class "dsTMatrix", by class "dsparseMatrix", distance 2
Class "dgCMatrix", by class "dsparseMatrix", distance 2
Class "dtCMatrix", by class "dsparseMatrix", distance 2
Class "dsCMatrix", by class "dsparseMatrix", distance 2
Class "dgRMatrix", by class "dsparseMatrix", distance 2
Class "dtRMatrix", by class "dsparseMatrix", distance 2
Class "dsRMatrix", by class "dsparseMatrix", distance 2
Class "lgTMatrix", by class "lsparseMatrix", distance 2
Class "ltTMatrix", by class "lsparseMatrix", distance 2
Class "lsTMatrix", by class "lsparseMatrix", distance 2
Class "lgCMatrix", by class "lsparseMatrix", distance 2
Class "ltCMatrix", by class "lsparseMatrix", distance 2
Class "lsCMatrix", by class "lsparseMatrix", distance 2
Class "lgRMatrix", by class "lsparseMatrix", distance 2
Class "ltRMatrix", by class "lsparseMatrix", distance 2
Class "lsRMatrix", by class "lsparseMatrix", distance 2
Class "ngTMatrix", by class "nsparseMatrix", distance 2
Class "ntTMatrix", by class "nsparseMatrix", distance 2
Class "nsTMatrix", by class "nsparseMatrix", distance 2
Class "ngCMatrix", by class "nsparseMatrix", distance 2
Class "ntCMatrix", by class "nsparseMatrix", distance 2
Class "nsCMatrix", by class "nsparseMatrix", distance 2
Class "ngRMatrix", by class "nsparseMatrix", distance 2
Class "ntRMatrix", by class "nsparseMatrix", distance 2
Class "nsRMatrix", by class "nsparseMatrix", distance 2
Class "pMatrix", by class "indMatrix", distance 2
言語処理で扱うデータは大規模なスパースになる傾向があり、統計モデリングや機械学習するには逐次処理させるか、データを疎行列化(モデル式で疎行列化できるMatrix::sparse.model.matrixが便利かも)して、疎行列に対応した関数を適用する必要があります。{glmnet}や{xgboost}, {e1071}などに疎行列に対応した関数が定義されています(他にも{MatrixModels}や{sparsereg}も対応している模様(こちらは調査中))。
Sparse Model Matrices for Generalized Linear Models
The Sparse Matrix and {glmnet}
MatrixModels: Modelling with Sparse And Dense Matrices
sparsereg: Sparse Bayesian Models for Regression, Subgroup Analysis, and Panel Data
– また、前回の課題で使った{FeatureHashing}のhashed.model.matrix関数はdgcMatrix引数をTRUEにすると、疎行列として結果を返します(前回の課題72を参照のこと)。
主成分分析の課題では特異値分解した結果を使いました。この際、implicitly restarted Lanczos bidiagonalizationを実装した{irlba}を用いましたが、他にも乱拓法による近似もあります(よく理解していないので、下記に参照したリンクを記述)。
Augmented Implicitly Restarted Lanczos Bidiagonalization Methods
Partial Lanczos SVD methods for R
– 乱拓法
Fast Randomized SVD
SVD for sparse matrix in R
RedSVD
RcppEigen and SVD
– その他メモ
大規模データ処理のための行列の低ランク近似 – SVD から用例ベースの行列分解まで –
18 Matrix decomposition and latent semantic indexing (pp.369-384)
主成分分析と特異値分解
潜在的意味インデキシング(LSI)徹底入門
今年のSIGKDDベストペーパーを実装・公開してみました
# RedSVDをHomebrewからインストールして、RからRedSVDを使おうとした際のメモ(not run)
# https://github.com/ntessore/homebrew-nt
brew tap ntessore/homebrew-nt
brew install redsvd
# テストで失敗する
# devtools::install_github("xiangze/RRedsvd")
本課題をこなすにあたり、下記の書籍で学んでおくといいと思います(自分復習用)。
Rによるハイパフォーマンスコンピューティング
プログラミングのための線形代数
これなら分かる応用数学教室―最小二乗法からウェーブレットまで
個人的には、言語処理の課題よりも大規模データの処理を考える方が難しかったです。
library(devtools)
devtools::session_info()
Session info --------------------------------------------------------------
setting value
version R version 3.2.1 (2015-06-18)
system x86_64, darwin13.4.0
ui X11
language (EN)
collate ja_JP.UTF-8
tz Asia/Tokyo
Packages ------------------------------------------------------------------
package * version date
assertthat * 0.1 2013-12-06
codetools 0.2-11 2015-03-10
colorspace 1.2-6 2015-03-11
countrycode * 0.18 2014-12-29
crayon 1.3.0 2015-06-05
curl 0.9 2015-06-19
DBI 0.3.1 2014-09-24
devtools * 1.8.0 2015-05-09
digest 0.6.8 2014-12-31
doParallel 1.0.8 2014-02-28
doRNG 1.6 2014-03-07
dplyr * 0.4.2.9002 2015-07-25
evaluate 0.7 2015-04-21
foreach 1.4.2 2014-04-11
formatR 1.2 2015-04-21
ggplot2 * 1.0.1 2015-03-17
git2r 0.10.1 2015-05-07
gtable 0.1.2 2012-12-05
hadleyverse * 0.1 2015-08-09
haven * 0.2.0 2015-04-09
htmltools 0.2.6 2014-09-08
inline * 0.3.14 2015-04-13
irlba * 1.0.3 2014-01-25
iterators 1.0.7 2014-04-11
knitr * 1.10.5 2015-05-06
lattice 0.20-31 2015-03-30
lazyeval * 0.1.10.9000 2015-07-25
lubridate * 1.3.3 2013-12-31
magrittr 1.5 2014-11-22
MASS 7.3-41 2015-06-18
Matrix * 1.2-1 2015-06-01
memoise 0.2.1 2014-04-22
munsell 0.4.2 2013-07-11
pforeach * 1.3 2015-07-25
pkgmaker 0.22 2014-05-14
plyr * 1.8.3 2015-06-12
proto 0.3-10 2012-12-22
R6 2.0.1 2014-10-29
Rcpp * 0.12.0 2015-07-26
readr * 0.1.1.9000 2015-07-25
readxl * 0.1.0 2015-04-14
registry 0.2 2012-01-24
reshape2 1.4.1 2014-12-06
rmarkdown 0.7 2015-06-13
rngtools 1.2.4 2014-03-06
rversions 1.0.1 2015-06-06
scales 0.2.5 2015-06-12
stringi * 0.5-5 2015-06-29
stringr * 1.0.0.9000 2015-07-25
testthat * 0.10.0 2015-05-22
tidyr * 0.2.0.9000 2015-07-25
xml2 * 0.1.1 2015-06-02
xtable 1.7-4 2014-09-12
yaml 2.1.13 2014-06-12
source
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.1)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (hadley/dplyr@75e8303)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (aaboyles/hadleyverse@16532fe)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (hadley/lazyeval@ecb8dc0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (hoxo-m/pforeach@2c44f3b)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (RcppCore/Rcpp@6ae91cc)
Github (hadley/readr@f4a3956)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
Github (hadley/stringr@380c88f)
CRAN (R 3.2.0)
Github (hadley/tidyr@0dc87b2)
CRAN (R 3.2.0)
CRAN (R 3.2.0)
CRAN (R 3.2.0)