前回までのrPubsページ
- 第1章:準備運動
- 第2章:UNIXコマンドの基礎
- 第3章:正規表現
- 第4章:形態素解析
- 第5章:構文解析
- 第6章:英語テキストの処理
- 第7章:データベース
- 第8章:機械学習

前回引き続き、言語処理100本ノック(2015年版)を解きます。

(下記の『前書き(言語処理100本ノックについて)』は前回と同じです)


概要

前書き(言語処理100本ノックについて)
- 本稿では、東北大学の乾・岡崎研究室で公開されている言語処理100本ノック(2015年版)を、R言語で解いていきます。
- 改訂前の言語処理100本ノックも同様に上記研究室のサイトにあります。


前書き(Rに関して)
- Rの構文や関数についての説明は一切ありませんので、あらかじめご了承ください。
- 本稿では、{base}にある文字列処理ではなく、{stringr}(1.0.0以上)とパイプ処理を極力用いております({stringi}も処理に応じて活用していきます)。課題によってはパイプ処理でこなすのに向かない状況もありますので、あらかじめご了承ください。
- 今回は上記に加え、{pforeach}を用いて入力や集計を並列処理し、{Matrix}で疎行列を処理していきます。


参考ページ


ご意見やご指摘など
- こうした方が良いやこういう便利な関数がある、間違いがあるなど、ご指摘をお待ちしております。
- 下記のいずれかでご連絡・ご報告いただけますと励みになります(なお、Gitに慣れていない人です)。
 Twitter, GitHub



Rコード

パッケージ読み込み

# devtools::install_github("aaboyles/hadleyverse")
SET_LOAD_LIB <- c("knitr", "hadleyverse", "stringi", "lazyeval", "pforeach", "countrycode", "Matrix", "irlba", "Rcpp", "inline")
sapply(X = SET_LOAD_LIB, FUN = library, character.only = TRUE, logical.return = TRUE)
##       knitr hadleyverse     stringi    lazyeval    pforeach countrycode 
##        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
##      Matrix       irlba        Rcpp      inline 
##        TRUE        TRUE        TRUE        TRUE
load_packages <- as.character(na.omit(object = stringr::str_match(string = search(), pattern = "^package:(.*)")[, 2]))
knitr::opts_chunk$set(comment = NA)

事前準備

# 第9章の入力データURL(固定)
# TASK_INPUT_URL <- "http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r10-105752.txt.bz2"
TASK_INPUT_URL <- "http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2"


# ファイル取得 
download.file(
  url = TASK_INPUT_URL, destfile = basename(TASK_INPUT_URL), 
  method = "wget", quiet = FALSE
)
if (!file.exists(file =  basename(TASK_INPUT_URL))) {
  stop("File not found.")   
}

80. コーパスの整形

文を単語列に変換する最も単純な方法は,空白文字で単語に区切ることである. ただ,この方法では文末のピリオドや括弧などの記号が単語に含まれてしまう. そこで,コーパスの各行のテキストを空白文字でトークンのリストに分割した後,各トークンに以下の処理を施し,単語から記号を除去せよ.
- トークンの先頭と末尾に出現する次の文字を削除: .,!?;:()[]'"
- 空文字列となったトークンは削除
以上の処理を適用した後,トークンをスペースで連結してファイルに保存せよ.

tokenizeSentence <- function (
  str, pattern_list
) {

  return(
    dplyr::data_frame(
      text =  stringr::str_replace_all(
        string = stringr::str_split(string = str, pattern = pattern_list$SPLIT_SEP) %>%
          dplyr::combine(), 
        pattern = stringr::str_c(
          c(
            stringr::str_c("^", pattern_list$DELETE_CHAR),
            stringr::str_c(pattern_list$DELETE_CHAR, "$")
          ),
          collapse = "|"
        ),
        replacement = ""
      )
    ) %>%
      dplyr::filter(.$text != pattern_list$IS_NULL) %>%
      dplyr::combine(.) %>%
      unlist %>%
      stringr::str_c(collapse = pattern_list$JOIN_SEP) %>%
      dplyr::data_frame(text = .)
  )
}


SET_PARALLE <- list(IS_PARALLEL = TRUE, CORE = 3, ITERATORS_CHUNK_SIZE = 5000)
SET_CHAR_PATTERN <- list(
  SPLIT_SEP = "[:space:]",
  DELETE_CHAR = c("\\.", "\\,", "\\!", "\\?", "\\;", "\\:", "\\(", "\\)", "\\[", "\\]", "\\'", "\""),
  IS_NULL = "",
  JOIN_SEP = " "
)


input_con <- file(description = basename(TASK_INPUT_URL), open = "r")
iter_file <- iterators::ireadLines(
  con = input_con,
  n = SET_PARALLE$ITERATORS_CHUNK_SIZE
)
formatted_text <- pforeach::pforeach(
  read_df = iter_file,
  .c = rbind,
  .export = c("tokenizeSentence", "SET_CHAR_PATTERN"),
  .packages = load_packages,
  .parallel = SET_PARALLE$IS_PARALLEL, .cores = SET_PARALLE$CORE
)({
  data.frame(text = read_df, stringsAsFactors = FALSE) %>%
    dplyr::rowwise(.) %>%
    dplyr::do(., tokenizeSentence(str = .$text, pattern_list = SET_CHAR_PATTERN))
}) %>%
  print
Source: local data frame [252,694 x 1]
Groups: <by row>

                                                                            text
1                                                                      Anarchism
2    Anarchism is a political philosophy that advocates stateless societies ofte
3    As a subtle and anti-dogmatic philosophy anarchism draws on many currents o
4    The central tendency of anarchism as a social movement has been represented
5                                                      Etymology and terminology
6    The term anarchism is a compound word composed from the word anarchy and th
7                                                                        History
8                                                                        Origins
9    The earliest anarchist themes can be found in the 6th century BC among the 
10 The French renaissance political philosopher Étienne de La Boétie wrote in 
..                                                                           ...
close(input_con)

81. 複合語からなる国名への対処

英語では,複数の語の連接が意味を成すことがある.例えば,アメリカ合衆国は“United States”,イギリスは“United Kingdom”と表現されるが,“United”や“States”,“Kingdom”という単語だけでは,指し示している概念・実体が曖昧である.そこで,コーパス中に含まれる複合語を認識し,複合語を1語として扱うことで,複合語の意味を推定したい.しかしながら,複合語を正確に認定するのは大変むずかしいので,ここでは複合語からなる国名を認定したい.
インターネット上から国名リストを各自で入手し,80のコーパス中に出現する複合語の国名に関して,スペースをアンダーバーに置換せよ.例えば,“United States”は“United_States”,“Isle of Man”は“Isle_of_Man”になるはずである.

# 国名リストは{countrycode}のデータを使う
coutry_name <- stringr::str_replace(
  string = countrycode_data$country.name, pattern = "\\(.*\\)", replacement = ""
)
replace_coutry_name <- stringr::str_replace_all(
  string = coutry_name, pattern = " ", replacement = "_"
)
names(replace_coutry_name) <- coutry_name

# 国名を含む文だけ処理する
is_include_coutry <- sapply(
  X = lapply(X = formatted_text$text, FUN = stringr::str_detect, pattern = coutry_name),
  FUN = any
)
formatted_text$text[is_include_coutry] <- stringr::str_replace_all(
  str = formatted_text$text[is_include_coutry], pattern = replace_coutry_name
)


# ある程度の文字列長の国名を表示
formatted_text$text[is_include_coutry][
  stringr::str_length(string = formatted_text$text[is_include_coutry]) < 15
]
  [1] "India"           "Germany"         "Bavaria"        
  [4] "Japanese"        "Ancient Greece"  "Canada"         
  [7] "New_Zealand"     "Pakistan"        "India"          
 [10] "United_States"   "Goa and India"   "Japan"          
 [13] "Hungary"         "Andorra"         "Armenia"        
 [16] "Austria"         "Azerbaijan"      "Belarus"        
 [19] "Belgium"         "Bulgaria"        "Canada"         
 [22] "Croatia"         "Czech_Republic"  "Denmark"        
 [25] "Finland"         "France"          "Georgia"        
 [28] "Germany"         "Greece"          "Hungary"        
 [31] "Iceland"         "Ireland"         "Italy"          
 [34] "Kazakhstan"      "Kyrgyzstan"      "Spain"          
 [37] "Sweden"          "United_Kingdom"  "Canada"         
 [40] "United_States"   "Brazil"          "Chile"          
 [43] "France"          "La Malinche"     "New_Caledonia"  
 [46] "Canada"          "United_Kingdom"  "United_States"  
 [49] "United_Kingdom"  "France"          "United_States"  
 [52] "New_Zealand"     "Germany"         "Japanese"       
 [55] "Japanese"        "Seychelles"      "Antarctica"     
 [58] "Canada"          "Costa_Rica"      "France"         
 [61] "Indonesia"       "Japan"           "Malaysia"       
 [64] "Mexico"          "New_Zealand"     "Singapore"      
 [67] "United_Kingdom"  "United_States"   "Germany"        
 [70] "Sweden"          "United_States"   "Logic in India" 
 [73] "Logic in China"  "United_States"   "Nigeria"        
 [76] "Senegal"         "Somalia"         "Uganda"         
 [79] "China"           "Hong_Kong"       "Japan"          
 [82] "Korea"           "India"           "Philippines"    
 [85] "Morocco"         "Israel"          "Croatia"        
 [88] "Denmark"         "Finland"         "France"         
 [91] "Greece"          "Hungary"         "Italy"          
 [94] "Latvia"          "Malta"           "Netherlands"    
 [97] "Norway"          "Poland"          "Serbia"         
[100] "Spain"           "Sweden"          "Switzerland"    
[103] "Turkey"          "Cuba"            "Jamaica"        
[106] "Mexico"          "Panama"          "Costa_Rica"     
[109] "Nicaragua"       "Guatemala"       "Honduras"       
[112] "Australia"       "Fiji"            "New_Zealand"    
[115] "India"           "Boston Indiana"  "Peru Maine"     
[118] "Germany"         "Japan"           "Peru Vermont"   
[121] "Move to Italy"   "Tour de France"  "Reunion 1995"   
[124] "Tour de France"  "King of Jordan"  "Namur Belgium"  
[127] "Kosovo War"      "Iraq"            "Israel"         
[130] "China"           "Australia"       "United_Kingdom" 
[133] "United_States"   "Palau Islands"   "China"          
[136] "Japan"           "Korea"           "Mongolia"       
[139] "Singapore"       "United_States"   "Spain"          
[142] "Greece"          "India"           "China"          
[145] "Mexico"          "New France"      "United_States"  
[148] "Brazil"          "India"           "Bangladesh"     
[151] "Pakistan"        "Nepal"           "Sri_Lanka"      
[154] "Afghanistan"     "Turkey"          "Azerbaijan"     
[157] "Tajikistan"      "Egypt"           "Morocco"        
[160] "Serbia"          "Reunion"         "Australia"      
[163] "New_Zealand"     "South_Africa"    "China"          
[166] "Bochum Germany"  "Japanese tour"   "South_Africa"   
[169] "Hong_Kong"       "Singapore"       "United_States"  
[172] "Canada"          "United_Kingdom"  "Germany"        
[175] "India"           "Armenia"         "Latvia"         
[178] "Lithuania"       "Slovakia"        "Spain"          
[181] "Mexico"          "Argentina"       "Colombia"       
[184] "Peru"            "Germany"         "United_Kingdom" 
[187] "United_States"   "France"          "Italy"          
[190] "China"           "Japan"           "War with Spain" 
[193] "United_States"   "United_States"   "Canada"         
[196] "Chile"           "Honduras"        "Mexico"         
[199] "Brazil"          "Austria"         "France"         
[202] "Ireland"         "Italy"           "Netherlands"    
[205] "United_Kingdom"  "Japan"           "Israel"         
[208] "Australia"       "New_Zealand"     "United_States"  
[211] "Cyprus"          "Egypt"           "Hungary"        
[214] "Morocco"         "India"           "United_States"  
[217] "Japanese"        "Avro Canada"     "Mali"           
[220] "Kenya"           "Gabon"           "Botswana"       
[223] "Burkina_Faso"    "Egypt"           "Kosovo"         
[226] "Montenegro"      "Greece"          "Canada"         
[229] "France"          "Germany"         "Italy"          
[232] "Malaysia"        "New_Zealand"     "Philippines"    
[235] "Poland"          "Portugal"        "Spain"          
[238] "Sri_Lanka"       "Thailand"        "Tunisia"        
[241] "United_States"   "Canada"          "Cuba"           
[244] "Haiti"           "Canada"          "Mexico"         
[247] "United_States"   "Canada"          "Australia"      
[250] "United_States"   "United_Kingdom"  "United_States"  
[253] "United_States"   "Jersey Numbers"  "United_Kingdom" 
[256] "India"           "Ireland"         "Puerto_Rico"    
[259] "Northern India"  "Southern India"  "Western India"  
[262] "Eastern India"   "Spain"           "China"          
[265] "Iraq War"        "United_Kingdom"  "United_States"  
[268] "HMS Fiji 58"     "Korean War"      "Lithuanians"    
[271] "1920 in France"  "Malta"           "Korean War"     
[274] "Australia"       "Canada"          "France"         
[277] "Germany"         "Japan"           "Netherlands"    
[280] "New_Zealand"     "Rwanda"          "Switzerland"    
[283] "United_Kingdom"  "United_States"   "Ireland"        
[286] "United_States"   "Australia"       "Denmark"        
[289] "South China"     "South China"     "Ottoman Greece" 
[292] "New_Zealand"     "South_Africa"    "France 3"       
[295] "Korean War"      "Indonesia"       "Egypt"          
[298] "Reunion"         "United_States"   "Canada"         
[301] "Germany"         "Monaco"          "Zimbabwe"       
[304] "United_States"   "Canada"          "Thailand"       
[307] "United_Kingdom"  "United_States"   "United_Kingdom" 
[310] "United_States"   "Canada"          "Mexico"         
[313] "Sweden"          "Ireland"         "Italy"          
[316] "Malta"           "Mexico"          "Australia"      
[319] "Australia"       "Belgium"         "Canada"         
[322] "Chile"           "China"           "Ecuador"        
[325] "France"          "Japan"           "Netherlands"    
[328] "New_Zealand"     "Poland"          "South_Africa"   
[331] "Turkey"          "United_Kingdom"  "United_States"  
[334] "Zimbabwe"        "Kosovo–Resava"   "United_States"  
[337] "Canada"          "In India"        "Korean War"     
[340] "South_Africa"    "Argentina"       "Australia"      
[343] "Belgium"         "Brazil"          "Italy"          
[346] "Israel"          "New_Zealand"     "Pakistan"       
[349] "Portugal"        "Philippines"     "Qatar"          
[352] "United_Kingdom"  "United_States"   "Thailand"       
[355] "Turkey"          "United_States"   "United_Kingdom" 
[358] "Japan"           "Australia"       "India"          
[361] "Israel"          "Malaysia"        "North Korea"    
[364] "United_Kingdom"  "United_States"   "Puerto_Rico"    
[367] "United_Kingdom"  "Netherlands"     "France"         
[370] "Reunion\""       "United_States"   "United_Kingdom" 
[373] "Japan only"      "Japan only"      "Japan"          
[376] "Little Italy"    "Australia"       "Austria"        
[379] "Hungary"         "North India"     "United_States"  
[382] "United_Kingdom"  "Belgium"         "New_Zealand"    
[385] "Ralph Jordan"    "Armenian"        "Indiana"        
[388] "Duke of Parma"   "Niger in 1974"   "United_States"  
[391] "Australia"       "Doug Ireland"    "Indonesia"      
[394] "Philippines"     "Kabri Israel"    "Kylie Ireland"  
[397] "Francesinha"     "Ireland"         "Romania"        
[400] "Korean War"      "United_States"   "Poland"         
[403] "Germany"         "Turkey"          "India"          
[406] "Japan"           "Australia"       "Argentina"      
[409] "South_Africa"    "Canada"          "Chile"          
[412] "United_States"   "Canada"          "United_Kingdom" 
[415] "Australia"       "In Georgian"     "In Belarusian"  
[418] "In Croatian"     "In Romanian"     "In Estonian"    
[421] "In Bulgarian"    "Korean"          "Japanese"       
[424] "United_States"   "Beer in Israel"  "Belgium"        
[427] "Australia"       "Canada"          "United_States"  
[430] "United_Kingdom"  "Finland"         "Canada"         
[433] "Greece"          "Iceland"         "Japan"          
[436] "New_Zealand"     "United_Kingdom"  "United_States"  
[439] "France"          "Germany"         "Japan"          
[442] "United_Kingdom"  "United_States"   "Croatia"        
[445] "Cyprus"          "Czech_Republic"  "France"         
[448] "Germany"         "Greece"          "Hungary"        
[451] "Ireland"         "Italy"           "Malta"          
[454] "Poland"          "Portugal"        "Serbia"         
[457] "Spain"           "Turkey"          "Ukraine"        
[460] "United_Kingdom"  "Canada"          "United_States"  
[463] "Mexico"          "South_Africa"    "Tunisia"        
[466] "Fiji"            "Australia"       "New_Zealand"    
[469] "Antarctica"      "Zimbabwe"        "Ireland"        
[472] "Australian CD"   "Air Jordans"     "Israel"         
[475] "ED250 Poland)"   "Guatemala Clan"  "Australia"      
[478] "Canada"          "Ireland"         "Japan"          
[481] "Malaysia"        "New_Zealand"     "Norway"         
[484] "Slovakia"        "South_Africa"    "Spain"          
[487] "United_Kingdom"  "United_States"   "United_States"  
[490] "United_States"   "In Albania"      "Funke v France" 
[493] "Georgia"         "Indiana"         "New Jersey"     
[496] "New Mexico"      "India"           "Japan"          
[499] "Lebanon"         "Philippines"     "Austria"        
[502] "Czech_Republic"  "Denmark"         "Finland"        
[505] "France"          "Germany"         "Iceland"        
[508] "Ireland"         "Italy"           "Netherlands"    
[511] "Norway"          "Poland"          "Portugal"       
[514] "Romania"         "Slovakia"        "Slovenia"       
[517] "Spain"           "Sweden"          "United_Kingdom" 
[520] "Canada"          "Honduras"        "Mexico"         
[523] "United_States"   "Australia"       "New_Zealand"    
[526] "Brazil"          "Peru"            "Media of Sudan" 
[529] "China"           "Japan"           "Korea"          
[532] "Singapore"       "United_States"   "Australia"      
[535] "Armenian"        "Egypt"           "Perugia"        
[538] "Perugia"         "Bank of Canada"  "O Canada"       
[541] "Guam"            "Japan"           "Korean War"     
[544] "In Zimbabwe"     "Zambia"          "Spain"          
[547] "Greenland"       "Indian College"  "Sweden"         
[550] "In Israel"       "By Israelis"     "Indian Agent"   
[553] "Paraguayan War"  "France"          "Mongolian name" 
[556] "Mongolia"        "Inner Mongolia"  "Ukraine"        
[559] "Estonia"         "Czech_Republic"  "In Korea"       
[562] "Korean War"      "Afghanistan"     "Iraq"           
[565] "Oia Greece"      "Ilija Slovakia"  "Cuban art"      
[568] "Iceland"         "Greenland"       "Norway"         
[571] "Sweden"          "Denmark"         "Korean War"     
[574] "Morocco"         "East Pakistan"   "Emma of Italy"  
[577] "Japanese ska"    "Indian squad"    "Italy"          
[580] "New Guinea"      "Korean War"      "Arts of China"  
[583] "Argentina"       "Brazil"          "Chile"          
[586] "Colombia"        "Peru"            "SOS Israel"     
[589] "Queen of Spain"  "India 2007)"     "India 2008)"    
[592] "India 2014)"     "United_Kingdom"  "United_States"  
[595] "Haiti"           "United_States"   "North Koreans"  
[598] "South Koreans"   "Sulm Austria"    "South China"    
[601] "• Albanian"      "• Latvian"       "• Romanian"     
[604] "• Serbian"       "Italy"           "Montenegro"     
[607] "Parma Panthers"  "Malik Raja"      "Australia"      
[610] "Italy"           "United_Kingdom"  "Poland"         
[613] "Uganda"          "United_States"   "Greece"         
[616] "Yugoslavia"      "Poland"          "Czechoslovakia" 
[619] "Bulgaria"        "Romania"         "Hungary"        
[622] "Peru also is"    "vs Fiji"         "Bavarian Cup"   
[625] "Time in Ghana"   "France"          "United_States"  
[628] "Australia"       "Australia"       "Belize"         
[631] "Canada"          "France"          "New_Zealand"    
[634] "Philippines"     "United_Kingdom"  "United_States"  
[637] "New Jersey"      "Mexico U23"      "United_States"  
[640] "Mexico"          "New_Zealand"     "Colombia"       
[643] "Viet_Nam War"    "Tonga"           "In Canada"      
[646] "Aki Polandwood"  "Australia"       "Fiji"           
[649] "Kiribati"        "Nauru"           "New_Zealand"    
[652] "Palau"           "Samoa"           "Tonga"          
[655] "Tuvalu"          "Vanuatu"         "and Japanese"   
[658] "United_States"   "Spain"           "Mexico"         
[661] "Canada"          "Mexico"          "China"          
[664] "United_Kingdom"  "Indonesia"       "Thailand"       
[667] "Japan"           "Australia"       "Australia"      
[670] "United_Kingdom"  "Canada"          "Japan"          
[673] "United_States"   "Libya"           "Ireland"        
[676] "France"          "Albania"         "Czechoslovakia" 
[679] "Hong_Kong"       "Hungary"         "Indonesia"      
[682] "Japan"           "Poland"          "Romania"        
[685] "Serbia"          "South Korea"     "United_States"  
[688] "United_States"   "After Bermuda"   "Georgia"        
[691] "New Jersey"      "United_Kingdom"  "United_States"  
[694] "Switzerland"     "Wild Turkey"     "United_States"  
[697] "American_Samoa"  "Lower Austria"   "Upper Austria"  
[700] "United_States"   "Mozambique"      "Thailand"       
[703] "Gulf of Mexico"  "Australia"       "Canada"         
[706] "Greece"          "India"           "Israel"         
[709] "Mexico"          "New_Zealand"     "South_Africa"   
[712] "United_Kingdom"  "United_States"   "South_Africa"   
[715] "East Indians"    "Portugal"        "Albania"        
[718] "Austria"         "Belgium"         "Bulgaria"       
[721] "Croatia"         "Cyprus"          "Czech_Republic" 
[724] "Denmark"         "Estonia"         "Finland"        
[727] "France"          "Germany"         "Greece"         
[730] "Hungary"         "Iceland"         "Ireland"        
[733] "Italy"           "Kosovo"          "Latvia"         
[736] "Lithuania"       "Luxembourg"      "Malta"          
[739] "Monaco"          "Montenegro"      "Netherlands"    
[742] "Norway"          "Poland"          "Portugal"       
[745] "Romania"         "Serbia"          "Slovenia"       
[748] "Spain"           "Sweden"          "Switzerland"    
[751] "Gibraltar"       "Georgia State"   "Liberia"        
[754] "Reach Canada"    "New Guinea"      "Philippines"    
[757] "Hong_Kong"       "Philippines"     "Trip to Japan"  
[760] "France"          "Germany"         "Australia"      
[763] "Canada"          "Israel"          "New_Zealand"    
[766] "United_States"   "Haiti 1802"      "Brazil"         
[769] "Reunion Gummer"  "Korea"           "Hong_Kong"      
[772] "Australia"       "Bahrain"         "Canada"         
[775] "Cyprus"          "Jersey"          "Mauritius"      
[778] "South_Africa"    "United_States"   "Finland"        
[781] "Ireland"         "Israel"          "Netherlands"    
[784] "Tour de France"  "Puerto_Rico"     "Japan"          
[787] "Brazil Gloob"    "Portugal RTP2"   "Origins Canada" 
[790] "India"           "Iraq"            "Japan"          
[793] "Libya"           "United_States"   "Israel"         
[796] "North Korea"     "Canada"          "Australia"      
[799] "Caen France"     "GAC Ireland"     "Jordan Obita"   
[802] "Brad Jordan"     "Taylah Jordan"   "Israeli Druze"  
[805] "In Israel"       "Jordan Kovacs"   "New Jersey"     
[808] "Korean War"      "Norway"          "Iraq"           
[811] "Greece"          "Denmark"         "Spain"          
[814] "Yugoslavia"      "Romania"         "Sweden"         
[817] "Belgium"         "Luxembourg"      "Parma"          
[820] "Japan"           "Brazil"          "Canada"         
[823] "Australia"       "Ireland"         "Hockey Canada"  
[826] "Italy"           "France"          "Greece"         
[829] "Georgia Force"   "New Jersey"      "South_Africa"   
[832] "Korean War"      "Japan"           "United_States"  
[835] "Pakistan"        "India"           "Libyan people"  
[838] "Italy"           "Japan"           "India"          
[841] "In Australia"    "In New_Zealand"  "Canada"         
[844] "Croatia"         "France"          "India"          
[847] "Italy"           "Morocco"         "Norway"         
[850] "Peru"            "Portugal"        "Spain"          
[853] "Turkey"          "United_States"   "Frances Hugle"  
[856] "Korean War"      "Canada"          "China"          
[859] "Germany"         "India"           "Israel"         
[862] "Japan"           "Norway"          "b Ghana’s Gold"
[865] "Israel"          "France"          "United_Kingdom" 
[868] "Liechtenstein"   "Spain"           "United_States"  
[871] "Israel"          "Peru"            "Burundi"        
[874] "Haiti"           "Mexico"          "Thailand"       
[877] "Korean War"      "Malin Diaz"      "Germany - Puma" 
[880] "Turkey - Altay"  "Korean War"      "Indian IS 1293" 
[883] "Tunisia"         "Oman"            "Italy"          
[886] "Japanese"        "In Estonia"      "In Latvia"      
[889] "Visit to Italy"  "East Germany"    "Cuban Thaw"     

82. 文脈の抽出

81で作成したコーパス中に出現するすべての単語tに関して,単語tと文脈語cのペアをタブ区切り形式ですべて書き出せ.ただし,文脈語の定義は次の通りとする.
- ある単語tの前後d単語を文脈語cとして抽出する(ただし,文脈語に単語tそのものは含まない)
- 単語tを選ぶ度に,文脈幅dは{1,2,3,4,5}の範囲でランダムに決める.

# word, type(前 or 後), window, context_word
fetchContextWord <- function (
  str, window_size, seed = NULL
){

  str_len <- length(str)
  if (is.null(seed)){
    set.seed(seed = seed)
  }
  context_window <- sample.int(n = window_size, size = str_len, replace = TRUE)
  word_window <- seq(from = 1, to = str_len) + context_window
  word_window[word_window > str_len] <- str_len
  
  # 単語の後側d語
  right_word <- stringr::str_split_fixed(
    string = stringr::word(
      string = stringi::stri_flatten(str = str, collapse = " "),
      start = seq(from = 1, to = str_len), end = word_window,
      sep = " "
    ),
    pattern = " ",
    n = context_window + 1
  )
  if (ncol(right_word) < window_size + 1) {
    right_word <- cbind(
      right_word,
      matrix(data = "", nrow = nrow(right_word), ncol = (window_size + 1) - ncol(right_word))
    )
  }
  
  # 単語の前側d語
  left_word <- stringr::str_split_fixed(
    string = stringr::word(
      string = stringi::stri_flatten(str = str[seq(from = str_len, to = 1)], collapse = " "),
      start = seq(from = 1, to = str_len), end = word_window,
      sep = " "
    ),
    pattern = " ",
    n = context_window + 1
  )[seq(from = str_len, to = 1), , drop = FALSE]
  if (ncol(left_word) < window_size + 1) {
    left_word <- cbind(
      left_word,
      matrix(data = "", nrow = nrow(left_word), ncol = (window_size + 1) - ncol(left_word))
    )
  }


  right_window <- left_window <- seq(from = 1, to = window_size + 1)
  names(x = right_window) <-names(x = left_window) <- c("word", seq(from = 1, to = window_size))

  return(
    dplyr::bind_rows(
      data.frame(right_word, stringsAsFactors = FALSE) %>%
        dplyr::rename_(
          .dots = setNames(
            object = stringr::str_c("X", right_window), 
            nm = names(right_window)
          )
        ) %>%
        dplyr::mutate(type = "right"),
      data.frame(left_word, stringsAsFactors = FALSE) %>%
        dplyr::rename_(
          .dots = setNames(
            object = stringr::str_c("X", left_window), 
            nm = names(left_window)
          )
        ) %>% 
        dplyr::mutate(type = "left")
    ) %>%
      tidyr::gather_(
        key_col = c("window"), 
        gather_cols = as.character(seq(from = 1, to = window_size)),
        value_col = c("context_word")
    )
  )
}

SET_SEED <- 71
SET_WINDOW_SIZE <- 5
SET_PARALLE_SPLIT_SIZE <- 5000
SET_WRITE_FILE_NAME <- "term_context.tsv"


# 今回は文の順番は関係ないタスクなので、乱数を用いて文を分割して処理
formatted_text$split <- sort(
  x = sample.int(
    n = SET_PARALLE_SPLIT_SIZE, size = nrow(formatted_text),
    replace = TRUE
  )
)
formatted_text_iter <- iterators::isplit(x = formatted_text$text, f = formatted_text$split)
word_context_word <- pforeach::pforeach(
  p_read_df = formatted_text_iter,
  .c = rbind,
  .export = c("fetchContextWord", "SET_WINDOW_SIZE", "SET_SEED", "SET_WRITE_FILE_NAME"),
  .packages = load_packages,
  .parallel = SET_PARALLE$IS_PARALLEL, .cores = SET_PARALLE$CORE,
  .inorder = FALSE
)({
  dplyr::data_frame(text = p_read_df$value) %>%
    dplyr::rowwise(.) %>%
    dplyr::do(.,
      fetchContextWord(
        str = stringr::str_split(string = .$text, pattern = "[:blank:]") %>% 
          unlist,
        window_size = SET_WINDOW_SIZE,
        seed = SET_SEED
      )
    ) %>%
    dplyr::filter(word != "") %>%
    dplyr::filter(context_word != "") %>%
    dplyr::select(word, context_word) %>%
    readr::write_tsv(path = SET_WRITE_FILE_NAME, append = TRUE)
})

word_context_word <- readr::read_tsv(
  file = SET_WRITE_FILE_NAME, n_max = -1,
  col_names = c("word", "context_word")
) %>% 
  print

|================================================================================| 100%  782 MB
Warning: 719 problems parsing 'term_context.tsv'. See problems(...) for
more details.
Source: local data frame [68,078,486 x 2]

         word context_word
1   Anarchism           is
2          is            a
3           a    political
4   political   philosophy
5  philosophy         that
6        that    advocates
7   advocates    stateless
8   stateless    societies
9   societies        often
10      often      defined
..        ...          ...
readr::problems(x = word_context_word)
Source: local data frame [719 x 4]

       row col       expected actual
1  1169592   3 Only 2 columns       
2  1171327   3 Only 2 columns       
3  1175827   3 Only 2 columns       
4  1178457   3 Only 2 columns       
5  1180217   3 Only 2 columns       
6  1182196   3 Only 2 columns       
7  1183243   3 Only 2 columns       
8  1185986   3 Only 2 columns       
9  1186978   3 Only 2 columns       
10 1188287   3 Only 2 columns       
..     ... ...            ...    ...
word_context_word <- na.omit(word_context_word)

83. 単語/文脈の頻度の計測

82の出力を利用し,以下の出現分布,および定数を求めよ.
- f(t,c): 単語tと文脈語cの共起回数
- f(t,∗): 単語tの出現回数
- f(∗,c): 文脈語cの出現回数
- N: 単語と文脈語のペアの総出現回数

# f(t,c)
co_word_pair <- word_context_word %>%
  dplyr::group_by(word, context_word) %>%
  dplyr::summarize(tc = n()) %>%
  print
Source: local data frame [21,748,822 x 3]
Groups: word

                                                                                          word
1  \this\nform\tmaterial\nto\tform\nfeed\tto\nfor\tfeed\nthe\tfor\nlast\tthe\ntime\tlast\nfrom
2                                                                                            !
3                                                                                            !
4                                                                                            !
5                                                                                            !
6                                                                                            !
7                                                                                            !
8                                                                                            !
9                                                                                            !
10                                                                                           !
..                                                                                         ...
Variables not shown: context_word (chr), tc (int)
# f(t,∗)
word_freq <- word_context_word %>%
  dplyr::ungroup() %>%
  dplyr::group_by(word) %>%
  dplyr::summarize(t = n()) %>%
  print
Source: local data frame [432,329 x 2]

                                                                                          word
1  \this\nform\tmaterial\nto\tform\nfeed\tto\nfor\tfeed\nthe\tfor\nlast\tthe\ntime\tlast\nfrom
2                                                                                            !
3                                                                                          !*@
4                                                                                           !,
5                                                                                            "
6                                                                                           ""
7                                                                                          "")
8                                                                                      ""Abid"
9                                                              ""Allerheiligen-Wasserfälle"")
10                                                                                    ""Banque
..                                                                                         ...
Variables not shown: t (int)
# f(∗,c)
context_freq <- word_context_word %>%
  dplyr::ungroup() %>%
  dplyr::group_by(context_word) %>%
  dplyr::summarize(c = n()) %>%
  print
Source: local data frame [432,863 x 2]

                      context_word    c
1                                !   25
2                              !*@    5
3                               !,    6
4                                " 1890
5                               ""   57
6                              "")   34
7                          ""Abid"    7
8  ""Allerheiligen-Wasserfälle"")    4
9                         ""Banque    8
10                      ""Dampf"")    7
..                             ...  ...
# N
nrow(co_word_pair)
[1] 21748822

84. 単語文脈行列の作成

83の出力を利用し,単語文脈行列Xを作成せよ.ただし,行列Xの各要素Xtcは次のように定義する.
- f(t,c) ≥ 10ならば,\(X_{tc}\)=PPMI(t,c)=\(max \{log\frac{N \times f(t,c)}{f(t,∗) \times f(∗,c)},0 \}\)
- f(t,c) < 10ならば,\(X_{tc}\)=0
ここで,PPMI(t,c)はPositive Pointwise Mutual Information(正の相互情報量)と呼ばれる統計量である.なお,行列Xの行数・列数は数百万オーダとなり,行列のすべての要素を主記憶上に載せることは無理なので注意すること.幸い,行列Xのほとんどの要素は0になるので,非0の要素だけを書き出せばよい.

SET_THRESHOLD <- 10


ppmi <- suppressWarnings(
  dplyr::left_join(
    x = dplyr::left_join(
      x = co_word_pair %>%
        dplyr::filter(tc >= SET_THRESHOLD),
      y = word_freq,
      by = "word"
    ),
    y = context_freq,
    by = "context_word"
  ) %>%
  dplyr::mutate(ppmi = log((nrow(co_word_pair) * tc) / (t * c)))
)

# 桁あふれによるNaN, NA化した値をゼロに変換
ppmi$ppmi <- ifelse(ppmi$ppmi > 0, ppmi$ppmi, 0)
ppmi <- ppmi %>%
  dplyr::select(word, context_word, ppmi) %>%
  replace(is.na(.), 0)


# 疎行列化(単純にtidyr::spreadをすると、メモリ消費が大きい)
occur_words <- unique(word_freq$word)
ppmi$word <- factor(ppmi$word, levels = as.character(occur_words))
ppmi$context_word <- factor(ppmi$context_word, levels = as.character(occur_words))
word_context_ppmi <- Matrix::sparseMatrix(
  i = as.integer(ppmi$word), j = as.integer(ppmi$context_word),
  x = ppmi$ppmi,
  dims = c(length(occur_words), length(occur_words)),
  dimnames = list(occur_words, occur_words)
)

85. 主成分分析による次元圧縮

84で得られた単語文脈行列に対して,主成分分析を適用し,単語の意味ベクトルを300次元に圧縮せよ.

SET_DIM_NUM <- 300


# 疎行列に対応している特異値分解(SVD)の関数を使う
# ここではさらに左・右の特異ベクトルを近似した手法を用いる
# (理解が足らないので「所感」で触れた資料や書籍で勉強し直す)
lsi_res <- irlba::irlba(
  A = word_context_ppmi,
  nu = SET_DIM_NUM, nv = SET_DIM_NUM
)
word_sence <- t(
  t(lsi_res$u[, seq(from = 1, to = SET_DIM_NUM), drop = FALSE]) %*% word_context_ppmi
)
rownames(word_sence) <- rownames(word_context_ppmi)

86. 単語ベクトルの表示

85で得た単語の意味ベクトルを読み込み,“United States”のベクトルを表示せよ.ただし,“United States”は内部的には“United_States”と表現されていることに注意せよ.

SET_SEARCH_WORD <- "United_States"


# 計算しておいた意味ベクトルから行名でマッチング
word_sence[is.element(rownames(word_sence), SET_SEARCH_WORD), ]
  [1]  2.037838e+00 -9.375344e-01 -3.106189e-01 -1.605345e-01 -5.904840e-02
  [6]  6.518050e-01  2.602280e-02  5.277793e-02 -4.068961e-01  2.610499e-01
 [11]  3.918249e-01 -4.765525e-02  6.441821e-02  3.308307e-01 -7.000656e-01
 [16] -1.136182e+00 -7.595355e-01  1.312867e+00  2.849231e-01 -5.978687e-01
 [21] -2.377491e-01  2.204588e-01 -4.045608e-01  7.395452e-01  1.753993e-01
 [26] -5.848825e-01  7.487949e-02 -8.271620e-02  3.057006e-01 -2.858766e-01
 [31] -1.626025e-14 -7.482628e-02  5.950323e-01  4.345851e-01  1.774483e-01
 [36]  3.518586e-01 -1.127549e-01  1.978918e-02 -2.124593e-02  2.632636e-01
 [41]  1.988843e-03 -1.085607e-01 -5.579423e-02 -6.353527e-01 -8.574127e-02
 [46] -2.702903e-01 -1.086915e-01 -9.561545e-02  3.506378e-01 -1.136618e-01
 [51] -4.273408e-02  2.581737e-01  4.037139e-01 -5.753142e-02 -4.511197e-02
 [56]  3.837357e-01 -3.667011e-01 -5.704303e-01  4.766412e-01 -5.312001e-02
 [61]  5.940559e-01 -9.484519e-01  8.184152e-01 -2.463707e-01  4.159756e-01
 [66]  8.324350e-01  1.780600e-01 -2.443294e-01 -2.646356e-02  1.453572e-01
 [71]  2.839435e-01  2.313725e-01  6.264211e-01 -3.446654e-01  6.386369e-01
 [76]  3.006067e-02 -3.472549e-01  5.245093e-01  6.549072e-01  2.267178e-01
 [81] -1.744061e-01  1.544763e-01 -1.885759e-01  3.261650e-01  1.818196e-01
 [86] -5.395743e-01  5.581458e-01 -5.187504e-01  1.807359e-01 -3.550431e-01
 [91] -1.276461e-02 -8.716716e-02 -1.975487e-01  1.522283e-01  3.874773e-01
 [96]  1.084189e-02  1.789052e-01 -3.046082e-02  4.413375e-02  8.527400e-02
[101]  1.860033e-01 -1.483027e-02  8.239071e-01  2.443411e-01 -1.125405e-01
[106] -7.860227e-02  2.509777e-01  4.514155e-03 -4.874264e-01  4.044995e-02
[111]  2.337708e-01 -4.252291e-01 -8.265042e-01 -6.330464e-01 -1.257250e-01
[116] -1.402304e-01  5.298778e-01  8.078769e-02  1.996961e-01 -1.518596e-01
[121]  4.562999e-01 -5.120750e-01  1.661834e-01 -3.112895e-02  1.371840e-01
[126]  2.215744e-01 -4.915658e-01 -5.562550e-02  4.136874e-01  2.518635e-01
[131] -1.942698e-01  4.185125e-01  2.838158e-01 -1.023195e+00  2.331462e-01
[136]  3.405410e-01  2.619637e-01  1.839301e-01  4.260956e-03 -7.053745e-02
[141] -3.222604e-02  1.703391e-01 -1.908760e-01 -7.297733e-01  3.924142e-01
[146]  2.141091e-01 -8.826797e-02 -1.787676e-12 -3.463508e-01 -2.527450e-01
[151]  1.264967e-01  2.092116e-01 -1.912777e-01  2.862594e-01 -1.797915e-01
[156]  3.921730e-03 -1.882014e-01  3.413298e-01  7.014191e-02 -4.041935e-02
[161]  3.021720e-01 -5.929334e-02 -1.431572e-01 -4.116285e-01  2.270619e-01
[166] -1.322969e-01 -4.940011e-01 -3.803055e-01 -1.353691e-01 -5.702957e-01
[171] -1.425753e-01  2.804969e-01  9.874140e-02 -5.764361e-01 -1.951386e-01
[176]  3.810475e-01 -1.868904e-01  3.822573e-01 -6.042654e-02 -2.721631e-01
[181]  3.634367e-01  2.552693e-01 -2.499335e-02  1.852346e-01  6.149274e-01
[186] -2.711030e-01 -1.853175e-01 -1.172159e-01 -3.391866e-01 -3.626279e-02
[191]  5.045489e-01 -2.980203e-01  2.137921e-01 -4.796995e-01  2.405535e-01
[196] -9.349896e-01  5.770628e-02 -4.376850e-01 -3.033761e-01 -3.721689e-01
[201] -4.449024e-01 -4.249434e-01 -2.055324e-02  2.367674e-01  5.078755e-01
[206]  2.513147e-01  1.643741e-01 -1.901829e-01  5.284053e-03 -2.717752e-01
[211] -1.348662e-01  1.102604e-01 -1.819928e-01  3.032791e-01  8.319814e-02
[216] -1.158638e-01 -2.505669e-01 -2.838608e-03 -1.407063e-01  4.587667e-01
[221]  4.361180e-01 -3.722908e-01 -6.295600e-02 -4.343454e-02 -6.277148e-02
[226] -1.120107e-01 -1.999548e-01 -2.116093e-01 -7.791790e-01 -5.385965e-02
[231] -2.015585e-01 -2.973695e-01 -9.012819e-02  7.317029e-02  5.602437e-01
[236] -3.891752e-02 -7.414676e-01  3.921536e-02  2.230800e-01 -5.484260e-01
[241]  9.380090e-02  3.474563e-01  4.400047e-03 -8.169646e-01 -6.850005e-02
[246] -1.643937e-01 -2.460247e-12 -4.393261e-02 -5.913047e-01 -4.294057e-02
[251] -2.508756e-02  5.760611e-02 -2.234026e-01 -3.073265e-02 -3.955376e-02
[256] -1.963136e-01 -3.460861e-01  8.795343e-02  4.492767e-01  3.553542e-01
[261] -7.870723e-01 -4.611821e-01  1.242627e-01  1.111002e-02  4.436061e-01
[266]  5.281246e-01  2.755074e-01 -4.890112e-01  1.931703e-01 -2.591270e-02
[271] -2.714101e-01 -3.758363e-01  5.954379e-01  1.409803e-01  6.588941e-01
[276]  1.283339e-01  1.405396e-01 -2.381568e-01 -3.367582e-01 -1.644492e-01
[281]  4.049569e-02  2.608836e-01 -3.598677e-01  8.669979e-01 -1.865976e-01
[286]  3.423840e-01 -1.029759e-01  2.729978e-01  3.205047e-01 -2.795269e-01
[291] -1.308581e-01  6.806350e-02  1.276574e-01  9.465147e-02 -1.063264e-01
[296]  8.932431e-02 -2.399941e-01  3.492260e-01 -3.412677e-01 -3.299891e-01

87. 単語の類似度

85で得た単語の意味ベクトルを読み込み,“United States”と“U.S.”のコサイン類似度を計算せよ.ただし,“U.S.”は内部的に“U.S”と表現されていることに注意せよ.

# コサイン類似度を{Rcpp}で書く
# 今回は使わない
rcpp_cosine_sim <- '
  NumericVector calcCosineSim(NumericMatrix ipt_mat) {
    using namespace Rcpp;
    Environment base("package:base");
    Function Rcrossprod = base["crossprod"];
    Function Rtranspose = base["t"];
    Function Router = base["outer"];
    
    NumericMatrix numerator = Rcrossprod(Rtranspose(ipt_mat));
    NumericVector denominator = diag(numerator);
    return Rcpp::wrap(numerator / sqrt(Router(denominator, denominator)));
  }
'
calcCosine <- Rcpp::cppFunction(code = rcpp_cosine_sim, plugins = "cpp11")

# コサイン類似度の必要な行だけを取り出す(N * Nの行列の出力を避ける)
filterCosineSim <- function (
  seed_word_vector, target_word_vectors, 
  extract_rownames = NULL
) {
  word_vectors <- rbind(seed_word_vector, target_word_vectors)
  numerator <- crossprod(x = t(x = word_vectors))
  denominator <- diag(numerator)
  return((numerator / sqrt(outer(denominator, denominator)))[extract_rownames, ])
}

SET_COMPARE_WORDS <- c("United_States", "^U\\.S$")


filterCosineSim(
  seed_word_vector = word_sence[
    is.element(rownames(word_sence), SET_COMPARE_WORDS[1]), , drop = FALSE
  ],
  target_word_vectors = word_sence[
    stringr::str_detect(
      string = rownames(word_sence), pattern = SET_COMPARE_WORDS[-1]
    ), , drop = FALSE
  ], 
  extract_rownames = SET_COMPARE_WORDS[1]
)
United_States           U.S 
    1.0000000     0.7838752 

88. 類似度の高い単語10件

85で得た単語の意味ベクトルを読み込み,“England”とコサイン類似度が高い10語と,その類似度を出力せよ.

# 適当な乱数を割り当てて、一度にコサイン類似度を求めるベクトル数を減らす
fetchCosineSimilarity <- function(
  seed_word_vector, target_words_sence, 
  seed_word_name,
  split_size
){
  
  # 疎行列から行列へ変換
  seed_word_vector <- t(apply(X = seed_word_vector, MARGIN = 1, FUN = as.matrix))
  target_words_sence <- data.frame(
    t(apply(X = target_words_sence, MARGIN = 1, FUN = as.matrix)), 
    stringsAsFactors = FALSE
  )
  
  target_words_sence$split <- sample.int(
    n = split_size, size = nrow(target_words_sence),
    replace = TRUE
  )
  fetch_cs <- lapply(
    X = split(
      x = target_words_sence[, !is.element(colnames(target_words_sence), "split")],
      f = target_words_sence$split
    ),
    FUN = function (target_sence) {
      cosine_sim_res <- filterCosineSim(
        seed_word_vector = seed_word_vector,
        target_word_vectors = as.matrix(target_sence), 
        extract_rownames = seed_word_name
      )[-1]
      return(cosine_sim_res[!is.nan(x = cosine_sim_res)])
    }
  )
  cs_names <- dplyr::combine(sapply(X = fetch_cs, FUN = names))
  fetch_cs <- dplyr::combine(fetch_cs)
  names(fetch_cs) <- cs_names

  return(fetch_cs)
}

TASK_EXTRACT_NUM <- 10
SET_SEED_WORD <- c("England")
SET_SPLIT_SIZE <- as.integer(nrow(word_sence) / 5000)


# 並列処理しなくても充分だったのでしていない
fetch_cs <- fetchCosineSimilarity(
  seed_word_vector = word_sence[is.element(rownames(word_sence), SET_SEED_WORD), , drop = FALSE],
  target_words_sence = word_sence[!is.element(rownames(word_sence), SET_SEED_WORD), , drop = FALSE],
  seed_word_name = SET_SEED_WORD,
  split_size = SET_SPLIT_SIZE
)

# 上位10語を表示
sort(fetch_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]
      Cheshire       Scotland       Wildside          Italy            NWA 
     0.6423472      0.5718693      0.5519467      0.4917252      0.4814824 
United_Kingdom          Japan        Germany          Wales        Ireland 
     0.4802970      0.4760204      0.4742704      0.4617970      0.4571639 

89. 加法構成性によるアナロジー

85で得た単語の意味ベクトルを読み込み,vec(“Spain”) - vec(“Madrid”) + vec(“Athens”)を計算し,そのベクトルと類似度の高い10語とその類似度を出力せよ.

# 意味ベクトルを演算
# 「+」と「-」のみの演算子に対応(他の演算子はスルーするので注意)
createArithmeticWordVector <- function (
  word_sence, def_arithmetic
) {
  return(
    colSums(
      do.call(
        what = "rbind", 
        args = lapply(
          X = names(def_arithmetic),
          FUN = function (each_arithmetic) {
            return(
              switch(EXPR = as.character(def_arithmetic[each_arithmetic]),
                "+" = + word_sence[each_arithmetic, ],
                "-" = - word_sence[each_arithmetic, ]
              )
            )
          }
        )
      )
    )
  )
}

TASK_EXTRACT_NUM <- 10
SET_DEF_ARITHMETIC <- c("Spain" = "+", "Madrid" = "-", "Athens" = "+")
SET_SPLIT_SIZE <- as.integer(nrow(word_sence) / 5000)


# 演算対象から仮の名前を生成して、それで計算結果を絞り込み
create_arithmtic_word_name <- stringr::str_c(names(SET_DEF_ARITHMETIC), collapse = "_")
fetch_arithmetic_cs <- fetchCosineSimilarity(
  seed_word_vector = matrix(
    data = createArithmeticWordVector(
      word_sence = word_sence, def_arithmetic = SET_DEF_ARITHMETIC
    ),
    nrow = 1, ncol = ncol(word_sence),
    dimnames = list(create_arithmtic_word_name, NULL)
  ),
  target_words_sence = word_sence,
  seed_word_name = create_arithmtic_word_name,
  split_size = SET_SPLIT_SIZE
)

# 上位10語を表示
sort(fetch_arithmetic_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]
        Spain       Austria   Netherlands       Belgium      Brussels 
    0.9176357     0.9098281     0.9011710     0.8996146     0.8956092 
      Antwerp Télévisions        Turkey        France       Hungary 
    0.8954350     0.8952880     0.8932235     0.8909760     0.8909016 


所感

showClass("sparseMatrix")
Virtual Class "sparseMatrix" [package "Matrix"]

Slots:
                        
Name:       Dim Dimnames
Class:  integer     list

Extends: 
Class "Matrix", directly
Class "mMatrix", by class "Matrix", distance 2

Known Subclasses: 
Class "diagonalMatrix", directly
Class "TsparseMatrix", directly
Class "CsparseMatrix", directly
Class "RsparseMatrix", directly
Class "dsparseMatrix", directly
Class "lsparseMatrix", directly
Class "nsparseMatrix", directly
Class "indMatrix", directly
Class "ddiMatrix", by class "diagonalMatrix", distance 2
Class "ldiMatrix", by class "diagonalMatrix", distance 2
Class "dgTMatrix", by class "TsparseMatrix", distance 2
Class "dtTMatrix", by class "TsparseMatrix", distance 2
Class "dsTMatrix", by class "TsparseMatrix", distance 2
Class "lgTMatrix", by class "TsparseMatrix", distance 2
Class "ltTMatrix", by class "TsparseMatrix", distance 2
Class "lsTMatrix", by class "TsparseMatrix", distance 2
Class "ngTMatrix", by class "TsparseMatrix", distance 2
Class "ntTMatrix", by class "TsparseMatrix", distance 2
Class "nsTMatrix", by class "TsparseMatrix", distance 2
Class "dgCMatrix", by class "CsparseMatrix", distance 2
Class "dtCMatrix", by class "CsparseMatrix", distance 2
Class "dsCMatrix", by class "CsparseMatrix", distance 2
Class "lgCMatrix", by class "CsparseMatrix", distance 2
Class "ltCMatrix", by class "CsparseMatrix", distance 2
Class "lsCMatrix", by class "CsparseMatrix", distance 2
Class "ngCMatrix", by class "CsparseMatrix", distance 2
Class "ntCMatrix", by class "CsparseMatrix", distance 2
Class "nsCMatrix", by class "CsparseMatrix", distance 2
Class "dgRMatrix", by class "RsparseMatrix", distance 2
Class "dtRMatrix", by class "RsparseMatrix", distance 2
Class "dsRMatrix", by class "RsparseMatrix", distance 2
Class "lgRMatrix", by class "RsparseMatrix", distance 2
Class "ltRMatrix", by class "RsparseMatrix", distance 2
Class "lsRMatrix", by class "RsparseMatrix", distance 2
Class "ngRMatrix", by class "RsparseMatrix", distance 2
Class "ntRMatrix", by class "RsparseMatrix", distance 2
Class "nsRMatrix", by class "RsparseMatrix", distance 2
Class "dgTMatrix", by class "dsparseMatrix", distance 2
Class "dtTMatrix", by class "dsparseMatrix", distance 2
Class "dsTMatrix", by class "dsparseMatrix", distance 2
Class "dgCMatrix", by class "dsparseMatrix", distance 2
Class "dtCMatrix", by class "dsparseMatrix", distance 2
Class "dsCMatrix", by class "dsparseMatrix", distance 2
Class "dgRMatrix", by class "dsparseMatrix", distance 2
Class "dtRMatrix", by class "dsparseMatrix", distance 2
Class "dsRMatrix", by class "dsparseMatrix", distance 2
Class "lgTMatrix", by class "lsparseMatrix", distance 2
Class "ltTMatrix", by class "lsparseMatrix", distance 2
Class "lsTMatrix", by class "lsparseMatrix", distance 2
Class "lgCMatrix", by class "lsparseMatrix", distance 2
Class "ltCMatrix", by class "lsparseMatrix", distance 2
Class "lsCMatrix", by class "lsparseMatrix", distance 2
Class "lgRMatrix", by class "lsparseMatrix", distance 2
Class "ltRMatrix", by class "lsparseMatrix", distance 2
Class "lsRMatrix", by class "lsparseMatrix", distance 2
Class "ngTMatrix", by class "nsparseMatrix", distance 2
Class "ntTMatrix", by class "nsparseMatrix", distance 2
Class "nsTMatrix", by class "nsparseMatrix", distance 2
Class "ngCMatrix", by class "nsparseMatrix", distance 2
Class "ntCMatrix", by class "nsparseMatrix", distance 2
Class "nsCMatrix", by class "nsparseMatrix", distance 2
Class "ngRMatrix", by class "nsparseMatrix", distance 2
Class "ntRMatrix", by class "nsparseMatrix", distance 2
Class "nsRMatrix", by class "nsparseMatrix", distance 2
Class "pMatrix", by class "indMatrix", distance 2
# RedSVDをHomebrewからインストールして、RからRedSVDを使おうとした際のメモ(not run)
# https://github.com/ntessore/homebrew-nt
brew tap ntessore/homebrew-nt
brew install redsvd

# テストで失敗する
# devtools::install_github("xiangze/RRedsvd")


実行環境

library(devtools)
devtools::session_info()
Session info --------------------------------------------------------------
 setting  value                       
 version  R version 3.2.1 (2015-06-18)
 system   x86_64, darwin13.4.0        
 ui       X11                         
 language (EN)                        
 collate  ja_JP.UTF-8                 
 tz       Asia/Tokyo                  
Packages ------------------------------------------------------------------
 package     * version     date      
 assertthat  * 0.1         2013-12-06
 codetools     0.2-11      2015-03-10
 colorspace    1.2-6       2015-03-11
 countrycode * 0.18        2014-12-29
 crayon        1.3.0       2015-06-05
 curl          0.9         2015-06-19
 DBI           0.3.1       2014-09-24
 devtools    * 1.8.0       2015-05-09
 digest        0.6.8       2014-12-31
 doParallel    1.0.8       2014-02-28
 doRNG         1.6         2014-03-07
 dplyr       * 0.4.2.9002  2015-07-25
 evaluate      0.7         2015-04-21
 foreach       1.4.2       2014-04-11
 formatR       1.2         2015-04-21
 ggplot2     * 1.0.1       2015-03-17
 git2r         0.10.1      2015-05-07
 gtable        0.1.2       2012-12-05
 hadleyverse * 0.1         2015-08-09
 haven       * 0.2.0       2015-04-09
 htmltools     0.2.6       2014-09-08
 inline      * 0.3.14      2015-04-13
 irlba       * 1.0.3       2014-01-25
 iterators     1.0.7       2014-04-11
 knitr       * 1.10.5      2015-05-06
 lattice       0.20-31     2015-03-30
 lazyeval    * 0.1.10.9000 2015-07-25
 lubridate   * 1.3.3       2013-12-31
 magrittr      1.5         2014-11-22
 MASS          7.3-41      2015-06-18
 Matrix      * 1.2-1       2015-06-01
 memoise       0.2.1       2014-04-22
 munsell       0.4.2       2013-07-11
 pforeach    * 1.3         2015-07-25
 pkgmaker      0.22        2014-05-14
 plyr        * 1.8.3       2015-06-12
 proto         0.3-10      2012-12-22
 R6            2.0.1       2014-10-29
 Rcpp        * 0.12.0      2015-07-26
 readr       * 0.1.1.9000  2015-07-25
 readxl      * 0.1.0       2015-04-14
 registry      0.2         2012-01-24
 reshape2      1.4.1       2014-12-06
 rmarkdown     0.7         2015-06-13
 rngtools      1.2.4       2014-03-06
 rversions     1.0.1       2015-06-06
 scales        0.2.5       2015-06-12
 stringi     * 0.5-5       2015-06-29
 stringr     * 1.0.0.9000  2015-07-25
 testthat    * 0.10.0      2015-05-22
 tidyr       * 0.2.0.9000  2015-07-25
 xml2        * 0.1.1       2015-06-02
 xtable        1.7-4       2014-09-12
 yaml          2.1.13      2014-06-12
 source                               
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.1)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/dplyr@75e8303)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (aaboyles/hadleyverse@16532fe)
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/lazyeval@ecb8dc0)     
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hoxo-m/pforeach@2c44f3b)     
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (RcppCore/Rcpp@6ae91cc)       
 Github (hadley/readr@f4a3956)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/stringr@380c88f)      
 CRAN (R 3.2.0)                       
 Github (hadley/tidyr@0dc87b2)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)