前回までのrPubsページ
- 第1章:準備運動
- 第2章:UNIXコマンドの基礎
- 第3章:正規表現
- 第4章:形態素解析
- 第5章:構文解析
- 第6章:英語テキストの処理
- 第7章:データベース
- 第8章:機械学習

前回引き続き、言語処理100本ノック（2015年版）を解きます。

（下記の『前書き（言語処理100本ノックについて）』は前回と同じです）

概要

前書き（言語処理100本ノックについて）
- 本稿では、東北大学の乾・岡崎研究室で公開されている言語処理100本ノック（2015年版）を、R言語で解いていきます。
- 改訂前の言語処理100本ノックも同様に上記研究室のサイトにあります。

上記のふたつをご覧いただき、下記に進んでいただけますと幸いです。

前書き（Rに関して）
- Rの構文や関数についての説明は一切ありませんので、あらかじめご了承ください。
- 本稿では、{base}にある文字列処理ではなく、{stringr}（1.0.0以上）とパイプ処理を極力用いております（{stringi}も処理に応じて活用していきます）。課題によってはパイプ処理でこなすのに向かない状況もありますので、あらかじめご了承ください。
- 今回は上記に加え、{pforeach}を用いて入力や集計を並列処理し、{Matrix}で疎行列を処理していきます。

参考ページ

{stringr}と{stringi}
　hadley/stringr
　RPubs - このパッケージがすごい2014: stringr
　stringiで輝く☆テキストショリスト
　 stringr 1.0.0を使ってみる
{readr}
　hadley/readr
　readr とは？
　readr 0.0.0.9000を使ってみる
{iterators}と{foreach}
　Using The iterators Package
　Writing Custom Iterators
　Using The foreach Package
　Looping and iterators, part I
　foreachパッケージで並列化する時、現在の“環境”にない変数・関数は、明示的に.export引数にて指定しなければならない
　 forを捨てよ、foreachを書こう
{pforeach}
　hoxo-m/pforeach
　Rで超簡単に並列処理を書けるパッケージpforeach を作った
　 Rで超簡単に並列処理を書けるパッケージpforeach
{Matrix}
　Introduction to the Matrix package
　Sparse Model Matrices
　疎な行列をRで扱う -R tips-
　Rで疎な多次元配列を扱う（Matrix,slamパッケージ）
　Rによる文書分類入門

ご意見やご指摘など
- こうした方が良いやこういう便利な関数がある、間違いがあるなど、ご指摘をお待ちしております。
- 下記のいずれかでご連絡・ご報告いただけますと励みになります（なお、Gitに慣れていない人です）。
　Twitter, GitHub

Rコード

以下、ひたすら解いていきます。

パッケージ読み込み

# devtools::install_github("aaboyles/hadleyverse")
SET_LOAD_LIB <- c("knitr", "hadleyverse", "stringi", "lazyeval", "pforeach", "countrycode", "Matrix", "irlba", "Rcpp", "inline")
sapply(X = SET_LOAD_LIB, FUN = library, character.only = TRUE, logical.return = TRUE)

##       knitr hadleyverse     stringi    lazyeval    pforeach countrycode 
##        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
##      Matrix       irlba        Rcpp      inline 
##        TRUE        TRUE        TRUE        TRUE

load_packages <- as.character(na.omit(object = stringr::str_match(string = search(), pattern = "^package:(.*)")[, 2]))
knitr::opts_chunk$set(comment = NA)

事前準備

# 第9章の入力データURL（固定）
# TASK_INPUT_URL <- "http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r10-105752.txt.bz2"
TASK_INPUT_URL <- "http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2"


# ファイル取得 
download.file(
  url = TASK_INPUT_URL, destfile = basename(TASK_INPUT_URL), 
  method = "wget", quiet = FALSE
)
if (!file.exists(file =  basename(TASK_INPUT_URL))) {
  stop("File not found.")   
}

80. コーパスの整形

文を単語列に変換する最も単純な方法は，空白文字で単語に区切ることである．ただ，この方法では文末のピリオドや括弧などの記号が単語に含まれてしまう．そこで，コーパスの各行のテキストを空白文字でトークンのリストに分割した後，各トークンに以下の処理を施し，単語から記号を除去せよ．
- トークンの先頭と末尾に出現する次の文字を削除: .,!?;:()[]'"
- 空文字列となったトークンは削除
以上の処理を適用した後，トークンをスペースで連結してファイルに保存せよ．

tokenizeSentence <- function (
  str, pattern_list
) {

  return(
    dplyr::data_frame(
      text =  stringr::str_replace_all(
        string = stringr::str_split(string = str, pattern = pattern_list$SPLIT_SEP) %>%
          dplyr::combine(), 
        pattern = stringr::str_c(
          c(
            stringr::str_c("^", pattern_list$DELETE_CHAR),
            stringr::str_c(pattern_list$DELETE_CHAR, "$")
          ),
          collapse = "|"
        ),
        replacement = ""
      )
    ) %>%
      dplyr::filter(.$text != pattern_list$IS_NULL) %>%
      dplyr::combine(.) %>%
      unlist %>%
      stringr::str_c(collapse = pattern_list$JOIN_SEP) %>%
      dplyr::data_frame(text = .)
  )
}


SET_PARALLE <- list(IS_PARALLEL = TRUE, CORE = 3, ITERATORS_CHUNK_SIZE = 5000)
SET_CHAR_PATTERN <- list(
  SPLIT_SEP = "[:space:]",
  DELETE_CHAR = c("\\.", "\\,", "\\!", "\\?", "\\;", "\\:", "\\(", "\\)", "\\[", "\\]", "\\'", "\""),
  IS_NULL = "",
  JOIN_SEP = " "
)


input_con <- file(description = basename(TASK_INPUT_URL), open = "r")
iter_file <- iterators::ireadLines(
  con = input_con,
  n = SET_PARALLE$ITERATORS_CHUNK_SIZE
)
formatted_text <- pforeach::pforeach(
  read_df = iter_file,
  .c = rbind,
  .export = c("tokenizeSentence", "SET_CHAR_PATTERN"),
  .packages = load_packages,
  .parallel = SET_PARALLE$IS_PARALLEL, .cores = SET_PARALLE$CORE
)({
  data.frame(text = read_df, stringsAsFactors = FALSE) %>%
    dplyr::rowwise(.) %>%
    dplyr::do(., tokenizeSentence(str = .$text, pattern_list = SET_CHAR_PATTERN))
}) %>%
  print

Source: local data frame [252,694 x 1]
Groups: <by row>

                                                                            text
1                                                                      Anarchism
2    Anarchism is a political philosophy that advocates stateless societies ofte
3    As a subtle and anti-dogmatic philosophy anarchism draws on many currents o
4    The central tendency of anarchism as a social movement has been represented
5                                                      Etymology and terminology
6    The term anarchism is a compound word composed from the word anarchy and th
7                                                                        History
8                                                                        Origins
9    The earliest anarchist themes can be found in the 6th century BC among the 
10 The French renaissance political philosopher Étienne de La Boétie wrote in 
..                                                                           ...

close(input_con)

81. 複合語からなる国名への対処

英語では，複数の語の連接が意味を成すことがある．例えば，アメリカ合衆国は“United States”，イギリスは“United Kingdom”と表現されるが，“United”や“States”，“Kingdom”という単語だけでは，指し示している概念・実体が曖昧である．そこで，コーパス中に含まれる複合語を認識し，複合語を1語として扱うことで，複合語の意味を推定したい．しかしながら，複合語を正確に認定するのは大変むずかしいので，ここでは複合語からなる国名を認定したい．
インターネット上から国名リストを各自で入手し，80のコーパス中に出現する複合語の国名に関して，スペースをアンダーバーに置換せよ．例えば，“United States”は“United_States”，“Isle of Man”は“Isle_of_Man”になるはずである．

# 国名リストは{countrycode}のデータを使う
coutry_name <- stringr::str_replace(
  string = countrycode_data$country.name, pattern = "\\(.*\\)", replacement = ""
)
replace_coutry_name <- stringr::str_replace_all(
  string = coutry_name, pattern = " ", replacement = "_"
)
names(replace_coutry_name) <- coutry_name

# 国名を含む文だけ処理する
is_include_coutry <- sapply(
  X = lapply(X = formatted_text$text, FUN = stringr::str_detect, pattern = coutry_name),
  FUN = any
)
formatted_text$text[is_include_coutry] <- stringr::str_replace_all(
  str = formatted_text$text[is_include_coutry], pattern = replace_coutry_name
)


# ある程度の文字列長の国名を表示
formatted_text$text[is_include_coutry][
  stringr::str_length(string = formatted_text$text[is_include_coutry]) < 15
]

  [1] "India"           "Germany"         "Bavaria"        
  [4] "Japanese"        "Ancient Greece"  "Canada"         
  [7] "New_Zealand"     "Pakistan"        "India"          
 [10] "United_States"   "Goa and India"   "Japan"          
 [13] "Hungary"         "Andorra"         "Armenia"        
 [16] "Austria"         "Azerbaijan"      "Belarus"        
 [19] "Belgium"         "Bulgaria"        "Canada"         
 [22] "Croatia"         "Czech_Republic"  "Denmark"        
 [25] "Finland"         "France"          "Georgia"        
 [28] "Germany"         "Greece"          "Hungary"        
 [31] "Iceland"         "Ireland"         "Italy"          
 [34] "Kazakhstan"      "Kyrgyzstan"      "Spain"          
 [37] "Sweden"          "United_Kingdom"  "Canada"         
 [40] "United_States"   "Brazil"          "Chile"          
 [43] "France"          "La Malinche"     "New_Caledonia"  
 [46] "Canada"          "United_Kingdom"  "United_States"  
 [49] "United_Kingdom"  "France"          "United_States"  
 [52] "New_Zealand"     "Germany"         "Japanese"       
 [55] "Japanese"        "Seychelles"      "Antarctica"     
 [58] "Canada"          "Costa_Rica"      "France"         
 [61] "Indonesia"       "Japan"           "Malaysia"       
 [64] "Mexico"          "New_Zealand"     "Singapore"      
 [67] "United_Kingdom"  "United_States"   "Germany"        
 [70] "Sweden"          "United_States"   "Logic in India" 
 [73] "Logic in China"  "United_States"   "Nigeria"        
 [76] "Senegal"         "Somalia"         "Uganda"         
 [79] "China"           "Hong_Kong"       "Japan"          
 [82] "Korea"           "India"           "Philippines"    
 [85] "Morocco"         "Israel"          "Croatia"        
 [88] "Denmark"         "Finland"         "France"         
 [91] "Greece"          "Hungary"         "Italy"          
 [94] "Latvia"          "Malta"           "Netherlands"    
 [97] "Norway"          "Poland"          "Serbia"         
[100] "Spain"           "Sweden"          "Switzerland"    
[103] "Turkey"          "Cuba"            "Jamaica"        
[106] "Mexico"          "Panama"          "Costa_Rica"     
[109] "Nicaragua"       "Guatemala"       "Honduras"       
[112] "Australia"       "Fiji"            "New_Zealand"    
[115] "India"           "Boston Indiana"  "Peru Maine"     
[118] "Germany"         "Japan"           "Peru Vermont"   
[121] "Move to Italy"   "Tour de France"  "Reunion 1995"   
[124] "Tour de France"  "King of Jordan"  "Namur Belgium"  
[127] "Kosovo War"      "Iraq"            "Israel"         
[130] "China"           "Australia"       "United_Kingdom" 
[133] "United_States"   "Palau Islands"   "China"          
[136] "Japan"           "Korea"           "Mongolia"       
[139] "Singapore"       "United_States"   "Spain"          
[142] "Greece"          "India"           "China"          
[145] "Mexico"          "New France"      "United_States"  
[148] "Brazil"          "India"           "Bangladesh"     
[151] "Pakistan"        "Nepal"           "Sri_Lanka"      
[154] "Afghanistan"     "Turkey"          "Azerbaijan"     
[157] "Tajikistan"      "Egypt"           "Morocco"        
[160] "Serbia"          "Reunion"         "Australia"      
[163] "New_Zealand"     "South_Africa"    "China"          
[166] "Bochum Germany"  "Japanese tour"   "South_Africa"   
[169] "Hong_Kong"       "Singapore"       "United_States"  
[172] "Canada"          "United_Kingdom"  "Germany"        
[175] "India"           "Armenia"         "Latvia"         
[178] "Lithuania"       "Slovakia"        "Spain"          
[181] "Mexico"          "Argentina"       "Colombia"       
[184] "Peru"            "Germany"         "United_Kingdom" 
[187] "United_States"   "France"          "Italy"          
[190] "China"           "Japan"           "War with Spain" 
[193] "United_States"   "United_States"   "Canada"         
[196] "Chile"           "Honduras"        "Mexico"         
[199] "Brazil"          "Austria"         "France"         
[202] "Ireland"         "Italy"           "Netherlands"    
[205] "United_Kingdom"  "Japan"           "Israel"         
[208] "Australia"       "New_Zealand"     "United_States"  
[211] "Cyprus"          "Egypt"           "Hungary"        
[214] "Morocco"         "India"           "United_States"  
[217] "Japanese"        "Avro Canada"     "Mali"           
[220] "Kenya"           "Gabon"           "Botswana"       
[223] "Burkina_Faso"    "Egypt"           "Kosovo"         
[226] "Montenegro"      "Greece"          "Canada"         
[229] "France"          "Germany"         "Italy"          
[232] "Malaysia"        "New_Zealand"     "Philippines"    
[235] "Poland"          "Portugal"        "Spain"          
[238] "Sri_Lanka"       "Thailand"        "Tunisia"        
[241] "United_States"   "Canada"          "Cuba"           
[244] "Haiti"           "Canada"          "Mexico"         
[247] "United_States"   "Canada"          "Australia"      
[250] "United_States"   "United_Kingdom"  "United_States"  
[253] "United_States"   "Jersey Numbers"  "United_Kingdom" 
[256] "India"           "Ireland"         "Puerto_Rico"    
[259] "Northern India"  "Southern India"  "Western India"  
[262] "Eastern India"   "Spain"           "China"          
[265] "Iraq War"        "United_Kingdom"  "United_States"  
[268] "HMS Fiji 58"     "Korean War"      "Lithuanians"    
[271] "1920 in France"  "Malta"           "Korean War"     
[274] "Australia"       "Canada"          "France"         
[277] "Germany"         "Japan"           "Netherlands"    
[280] "New_Zealand"     "Rwanda"          "Switzerland"    
[283] "United_Kingdom"  "United_States"   "Ireland"        
[286] "United_States"   "Australia"       "Denmark"        
[289] "South China"     "South China"     "Ottoman Greece" 
[292] "New_Zealand"     "South_Africa"    "France 3"       
[295] "Korean War"      "Indonesia"       "Egypt"          
[298] "Reunion"         "United_States"   "Canada"         
[301] "Germany"         "Monaco"          "Zimbabwe"       
[304] "United_States"   "Canada"          "Thailand"       
[307] "United_Kingdom"  "United_States"   "United_Kingdom" 
[310] "United_States"   "Canada"          "Mexico"         
[313] "Sweden"          "Ireland"         "Italy"          
[316] "Malta"           "Mexico"          "Australia"      
[319] "Australia"       "Belgium"         "Canada"         
[322] "Chile"           "China"           "Ecuador"        
[325] "France"          "Japan"           "Netherlands"    
[328] "New_Zealand"     "Poland"          "South_Africa"   
[331] "Turkey"          "United_Kingdom"  "United_States"  
[334] "Zimbabwe"        "Kosovo–Resava"   "United_States"  
[337] "Canada"          "In India"        "Korean War"     
[340] "South_Africa"    "Argentina"       "Australia"      
[343] "Belgium"         "Brazil"          "Italy"          
[346] "Israel"          "New_Zealand"     "Pakistan"       
[349] "Portugal"        "Philippines"     "Qatar"          
[352] "United_Kingdom"  "United_States"   "Thailand"       
[355] "Turkey"          "United_States"   "United_Kingdom" 
[358] "Japan"           "Australia"       "India"          
[361] "Israel"          "Malaysia"        "North Korea"    
[364] "United_Kingdom"  "United_States"   "Puerto_Rico"    
[367] "United_Kingdom"  "Netherlands"     "France"         
[370] "Reunion\""       "United_States"   "United_Kingdom" 
[373] "Japan only"      "Japan only"      "Japan"          
[376] "Little Italy"    "Australia"       "Austria"        
[379] "Hungary"         "North India"     "United_States"  
[382] "United_Kingdom"  "Belgium"         "New_Zealand"    
[385] "Ralph Jordan"    "Armenian"        "Indiana"        
[388] "Duke of Parma"   "Niger in 1974"   "United_States"  
[391] "Australia"       "Doug Ireland"    "Indonesia"      
[394] "Philippines"     "Kabri Israel"    "Kylie Ireland"  
[397] "Francesinha"     "Ireland"         "Romania"        
[400] "Korean War"      "United_States"   "Poland"         
[403] "Germany"         "Turkey"          "India"          
[406] "Japan"           "Australia"       "Argentina"      
[409] "South_Africa"    "Canada"          "Chile"          
[412] "United_States"   "Canada"          "United_Kingdom" 
[415] "Australia"       "In Georgian"     "In Belarusian"  
[418] "In Croatian"     "In Romanian"     "In Estonian"    
[421] "In Bulgarian"    "Korean"          "Japanese"       
[424] "United_States"   "Beer in Israel"  "Belgium"        
[427] "Australia"       "Canada"          "United_States"  
[430] "United_Kingdom"  "Finland"         "Canada"         
[433] "Greece"          "Iceland"         "Japan"          
[436] "New_Zealand"     "United_Kingdom"  "United_States"  
[439] "France"          "Germany"         "Japan"          
[442] "United_Kingdom"  "United_States"   "Croatia"        
[445] "Cyprus"          "Czech_Republic"  "France"         
[448] "Germany"         "Greece"          "Hungary"        
[451] "Ireland"         "Italy"           "Malta"          
[454] "Poland"          "Portugal"        "Serbia"         
[457] "Spain"           "Turkey"          "Ukraine"        
[460] "United_Kingdom"  "Canada"          "United_States"  
[463] "Mexico"          "South_Africa"    "Tunisia"        
[466] "Fiji"            "Australia"       "New_Zealand"    
[469] "Antarctica"      "Zimbabwe"        "Ireland"        
[472] "Australian CD"   "Air Jordans"     "Israel"         
[475] "ED250 Poland)"   "Guatemala Clan"  "Australia"      
[478] "Canada"          "Ireland"         "Japan"          
[481] "Malaysia"        "New_Zealand"     "Norway"         
[484] "Slovakia"        "South_Africa"    "Spain"          
[487] "United_Kingdom"  "United_States"   "United_States"  
[490] "United_States"   "In Albania"      "Funke v France" 
[493] "Georgia"         "Indiana"         "New Jersey"     
[496] "New Mexico"      "India"           "Japan"          
[499] "Lebanon"         "Philippines"     "Austria"        
[502] "Czech_Republic"  "Denmark"         "Finland"        
[505] "France"          "Germany"         "Iceland"        
[508] "Ireland"         "Italy"           "Netherlands"    
[511] "Norway"          "Poland"          "Portugal"       
[514] "Romania"         "Slovakia"        "Slovenia"       
[517] "Spain"           "Sweden"          "United_Kingdom" 
[520] "Canada"          "Honduras"        "Mexico"         
[523] "United_States"   "Australia"       "New_Zealand"    
[526] "Brazil"          "Peru"            "Media of Sudan" 
[529] "China"           "Japan"           "Korea"          
[532] "Singapore"       "United_States"   "Australia"      
[535] "Armenian"        "Egypt"           "Perugia"        
[538] "Perugia"         "Bank of Canada"  "O Canada"       
[541] "Guam"            "Japan"           "Korean War"     
[544] "In Zimbabwe"     "Zambia"          "Spain"          
[547] "Greenland"       "Indian College"  "Sweden"         
[550] "In Israel"       "By Israelis"     "Indian Agent"   
[553] "Paraguayan War"  "France"          "Mongolian name" 
[556] "Mongolia"        "Inner Mongolia"  "Ukraine"        
[559] "Estonia"         "Czech_Republic"  "In Korea"       
[562] "Korean War"      "Afghanistan"     "Iraq"           
[565] "Oia Greece"      "Ilija Slovakia"  "Cuban art"      
[568] "Iceland"         "Greenland"       "Norway"         
[571] "Sweden"          "Denmark"         "Korean War"     
[574] "Morocco"         "East Pakistan"   "Emma of Italy"  
[577] "Japanese ska"    "Indian squad"    "Italy"          
[580] "New Guinea"      "Korean War"      "Arts of China"  
[583] "Argentina"       "Brazil"          "Chile"          
[586] "Colombia"        "Peru"            "SOS Israel"     
[589] "Queen of Spain"  "India 2007)"     "India 2008)"    
[592] "India 2014)"     "United_Kingdom"  "United_States"  
[595] "Haiti"           "United_States"   "North Koreans"  
[598] "South Koreans"   "Sulm Austria"    "South China"    
[601] "• Albanian"      "• Latvian"       "• Romanian"     
[604] "• Serbian"       "Italy"           "Montenegro"     
[607] "Parma Panthers"  "Malik Raja"      "Australia"      
[610] "Italy"           "United_Kingdom"  "Poland"         
[613] "Uganda"          "United_States"   "Greece"         
[616] "Yugoslavia"      "Poland"          "Czechoslovakia" 
[619] "Bulgaria"        "Romania"         "Hungary"        
[622] "Peru also is"    "vs Fiji"         "Bavarian Cup"   
[625] "Time in Ghana"   "France"          "United_States"  
[628] "Australia"       "Australia"       "Belize"         
[631] "Canada"          "France"          "New_Zealand"    
[634] "Philippines"     "United_Kingdom"  "United_States"  
[637] "New Jersey"      "Mexico U23"      "United_States"  
[640] "Mexico"          "New_Zealand"     "Colombia"       
[643] "Viet_Nam War"    "Tonga"           "In Canada"      
[646] "Aki Polandwood"  "Australia"       "Fiji"           
[649] "Kiribati"        "Nauru"           "New_Zealand"    
[652] "Palau"           "Samoa"           "Tonga"          
[655] "Tuvalu"          "Vanuatu"         "and Japanese"   
[658] "United_States"   "Spain"           "Mexico"         
[661] "Canada"          "Mexico"          "China"          
[664] "United_Kingdom"  "Indonesia"       "Thailand"       
[667] "Japan"           "Australia"       "Australia"      
[670] "United_Kingdom"  "Canada"          "Japan"          
[673] "United_States"   "Libya"           "Ireland"        
[676] "France"          "Albania"         "Czechoslovakia" 
[679] "Hong_Kong"       "Hungary"         "Indonesia"      
[682] "Japan"           "Poland"          "Romania"        
[685] "Serbia"          "South Korea"     "United_States"  
[688] "United_States"   "After Bermuda"   "Georgia"        
[691] "New Jersey"      "United_Kingdom"  "United_States"  
[694] "Switzerland"     "Wild Turkey"     "United_States"  
[697] "American_Samoa"  "Lower Austria"   "Upper Austria"  
[700] "United_States"   "Mozambique"      "Thailand"       
[703] "Gulf of Mexico"  "Australia"       "Canada"         
[706] "Greece"          "India"           "Israel"         
[709] "Mexico"          "New_Zealand"     "South_Africa"   
[712] "United_Kingdom"  "United_States"   "South_Africa"   
[715] "East Indians"    "Portugal"        "Albania"        
[718] "Austria"         "Belgium"         "Bulgaria"       
[721] "Croatia"         "Cyprus"          "Czech_Republic" 
[724] "Denmark"         "Estonia"         "Finland"        
[727] "France"          "Germany"         "Greece"         
[730] "Hungary"         "Iceland"         "Ireland"        
[733] "Italy"           "Kosovo"          "Latvia"         
[736] "Lithuania"       "Luxembourg"      "Malta"          
[739] "Monaco"          "Montenegro"      "Netherlands"    
[742] "Norway"          "Poland"          "Portugal"       
[745] "Romania"         "Serbia"          "Slovenia"       
[748] "Spain"           "Sweden"          "Switzerland"    
[751] "Gibraltar"       "Georgia State"   "Liberia"        
[754] "Reach Canada"    "New Guinea"      "Philippines"    
[757] "Hong_Kong"       "Philippines"     "Trip to Japan"  
[760] "France"          "Germany"         "Australia"      
[763] "Canada"          "Israel"          "New_Zealand"    
[766] "United_States"   "Haiti 1802"      "Brazil"         
[769] "Reunion Gummer"  "Korea"           "Hong_Kong"      
[772] "Australia"       "Bahrain"         "Canada"         
[775] "Cyprus"          "Jersey"          "Mauritius"      
[778] "South_Africa"    "United_States"   "Finland"        
[781] "Ireland"         "Israel"          "Netherlands"    
[784] "Tour de France"  "Puerto_Rico"     "Japan"          
[787] "Brazil Gloob"    "Portugal RTP2"   "Origins Canada" 
[790] "India"           "Iraq"            "Japan"          
[793] "Libya"           "United_States"   "Israel"         
[796] "North Korea"     "Canada"          "Australia"      
[799] "Caen France"     "GAC Ireland"     "Jordan Obita"   
[802] "Brad Jordan"     "Taylah Jordan"   "Israeli Druze"  
[805] "In Israel"       "Jordan Kovacs"   "New Jersey"     
[808] "Korean War"      "Norway"          "Iraq"           
[811] "Greece"          "Denmark"         "Spain"          
[814] "Yugoslavia"      "Romania"         "Sweden"         
[817] "Belgium"         "Luxembourg"      "Parma"          
[820] "Japan"           "Brazil"          "Canada"         
[823] "Australia"       "Ireland"         "Hockey Canada"  
[826] "Italy"           "France"          "Greece"         
[829] "Georgia Force"   "New Jersey"      "South_Africa"   
[832] "Korean War"      "Japan"           "United_States"  
[835] "Pakistan"        "India"           "Libyan people"  
[838] "Italy"           "Japan"           "India"          
[841] "In Australia"    "In New_Zealand"  "Canada"         
[844] "Croatia"         "France"          "India"          
[847] "Italy"           "Morocco"         "Norway"         
[850] "Peru"            "Portugal"        "Spain"          
[853] "Turkey"          "United_States"   "Frances Hugle"  
[856] "Korean War"      "Canada"          "China"          
[859] "Germany"         "India"           "Israel"         
[862] "Japan"           "Norway"          "b Ghana’s Gold"
[865] "Israel"          "France"          "United_Kingdom" 
[868] "Liechtenstein"   "Spain"           "United_States"  
[871] "Israel"          "Peru"            "Burundi"        
[874] "Haiti"           "Mexico"          "Thailand"       
[877] "Korean War"      "Malin Diaz"      "Germany - Puma" 
[880] "Turkey - Altay"  "Korean War"      "Indian IS 1293" 
[883] "Tunisia"         "Oman"            "Italy"          
[886] "Japanese"        "In Estonia"      "In Latvia"      
[889] "Visit to Italy"  "East Germany"    "Cuban Thaw"

82. 文脈の抽出

81で作成したコーパス中に出現するすべての単語tに関して，単語tと文脈語cのペアをタブ区切り形式ですべて書き出せ．ただし，文脈語の定義は次の通りとする．
- ある単語tの前後d単語を文脈語cとして抽出する（ただし，文脈語に単語tそのものは含まない）
- 単語tを選ぶ度に，文脈幅dは{1,2,3,4,5}の範囲でランダムに決める．

# word, type(前 or 後), window, context_word
fetchContextWord <- function (
  str, window_size, seed = NULL
){

  str_len <- length(str)
  if (is.null(seed)){
    set.seed(seed = seed)
  }
  context_window <- sample.int(n = window_size, size = str_len, replace = TRUE)
  word_window <- seq(from = 1, to = str_len) + context_window
  word_window[word_window > str_len] <- str_len
  
  # 単語の後側d語
  right_word <- stringr::str_split_fixed(
    string = stringr::word(
      string = stringi::stri_flatten(str = str, collapse = " "),
      start = seq(from = 1, to = str_len), end = word_window,
      sep = " "
    ),
    pattern = " ",
    n = context_window + 1
  )
  if (ncol(right_word) < window_size + 1) {
    right_word <- cbind(
      right_word,
      matrix(data = "", nrow = nrow(right_word), ncol = (window_size + 1) - ncol(right_word))
    )
  }
  
  # 単語の前側d語
  left_word <- stringr::str_split_fixed(
    string = stringr::word(
      string = stringi::stri_flatten(str = str[seq(from = str_len, to = 1)], collapse = " "),
      start = seq(from = 1, to = str_len), end = word_window,
      sep = " "
    ),
    pattern = " ",
    n = context_window + 1
  )[seq(from = str_len, to = 1), , drop = FALSE]
  if (ncol(left_word) < window_size + 1) {
    left_word <- cbind(
      left_word,
      matrix(data = "", nrow = nrow(left_word), ncol = (window_size + 1) - ncol(left_word))
    )
  }


  right_window <- left_window <- seq(from = 1, to = window_size + 1)
  names(x = right_window) <-names(x = left_window) <- c("word", seq(from = 1, to = window_size))

  return(
    dplyr::bind_rows(
      data.frame(right_word, stringsAsFactors = FALSE) %>%
        dplyr::rename_(
          .dots = setNames(
            object = stringr::str_c("X", right_window), 
            nm = names(right_window)
          )
        ) %>%
        dplyr::mutate(type = "right"),
      data.frame(left_word, stringsAsFactors = FALSE) %>%
        dplyr::rename_(
          .dots = setNames(
            object = stringr::str_c("X", left_window), 
            nm = names(left_window)
          )
        ) %>% 
        dplyr::mutate(type = "left")
    ) %>%
      tidyr::gather_(
        key_col = c("window"), 
        gather_cols = as.character(seq(from = 1, to = window_size)),
        value_col = c("context_word")
    )
  )
}

SET_SEED <- 71
SET_WINDOW_SIZE <- 5
SET_PARALLE_SPLIT_SIZE <- 5000
SET_WRITE_FILE_NAME <- "term_context.tsv"


# 今回は文の順番は関係ないタスクなので、乱数を用いて文を分割して処理
formatted_text$split <- sort(
  x = sample.int(
    n = SET_PARALLE_SPLIT_SIZE, size = nrow(formatted_text),
    replace = TRUE
  )
)
formatted_text_iter <- iterators::isplit(x = formatted_text$text, f = formatted_text$split)
word_context_word <- pforeach::pforeach(
  p_read_df = formatted_text_iter,
  .c = rbind,
  .export = c("fetchContextWord", "SET_WINDOW_SIZE", "SET_SEED", "SET_WRITE_FILE_NAME"),
  .packages = load_packages,
  .parallel = SET_PARALLE$IS_PARALLEL, .cores = SET_PARALLE$CORE,
  .inorder = FALSE
)({
  dplyr::data_frame(text = p_read_df$value) %>%
    dplyr::rowwise(.) %>%
    dplyr::do(.,
      fetchContextWord(
        str = stringr::str_split(string = .$text, pattern = "[:blank:]") %>% 
          unlist,
        window_size = SET_WINDOW_SIZE,
        seed = SET_SEED
      )
    ) %>%
    dplyr::filter(word != "") %>%
    dplyr::filter(context_word != "") %>%
    dplyr::select(word, context_word) %>%
    readr::write_tsv(path = SET_WRITE_FILE_NAME, append = TRUE)
})

word_context_word <- readr::read_tsv(
  file = SET_WRITE_FILE_NAME, n_max = -1,
  col_names = c("word", "context_word")
) %>% 
  print


|================================================================================| 100%  782 MB

Warning: 719 problems parsing 'term_context.tsv'. See problems(...) for
more details.

Source: local data frame [68,078,486 x 2]

         word context_word
1   Anarchism           is
2          is            a
3           a    political
4   political   philosophy
5  philosophy         that
6        that    advocates
7   advocates    stateless
8   stateless    societies
9   societies        often
10      often      defined
..        ...          ...

readr::problems(x = word_context_word)

Source: local data frame [719 x 4]

       row col       expected actual
1  1169592   3 Only 2 columns       
2  1171327   3 Only 2 columns       
3  1175827   3 Only 2 columns       
4  1178457   3 Only 2 columns       
5  1180217   3 Only 2 columns       
6  1182196   3 Only 2 columns       
7  1183243   3 Only 2 columns       
8  1185986   3 Only 2 columns       
9  1186978   3 Only 2 columns       
10 1188287   3 Only 2 columns       
..     ... ...            ...    ...

word_context_word <- na.omit(word_context_word)

83. 単語／文脈の頻度の計測

82の出力を利用し，以下の出現分布，および定数を求めよ．
- f(t,c): 単語tと文脈語cの共起回数
- f(t,∗): 単語tの出現回数
- f(∗,c): 文脈語cの出現回数
- N: 単語と文脈語のペアの総出現回数

# f(t,c)
co_word_pair <- word_context_word %>%
  dplyr::group_by(word, context_word) %>%
  dplyr::summarize(tc = n()) %>%
  print

Source: local data frame [21,748,822 x 3]
Groups: word

                                                                                          word
1  \this\nform\tmaterial\nto\tform\nfeed\tto\nfor\tfeed\nthe\tfor\nlast\tthe\ntime\tlast\nfrom
2                                                                                            !
3                                                                                            !
4                                                                                            !
5                                                                                            !
6                                                                                            !
7                                                                                            !
8                                                                                            !
9                                                                                            !
10                                                                                           !
..                                                                                         ...
Variables not shown: context_word (chr), tc (int)

# f(t,∗)
word_freq <- word_context_word %>%
  dplyr::ungroup() %>%
  dplyr::group_by(word) %>%
  dplyr::summarize(t = n()) %>%
  print

Source: local data frame [432,329 x 2]

                                                                                          word
1  \this\nform\tmaterial\nto\tform\nfeed\tto\nfor\tfeed\nthe\tfor\nlast\tthe\ntime\tlast\nfrom
2                                                                                            !
3                                                                                          !*@
4                                                                                           !,
5                                                                                            "
6                                                                                           ""
7                                                                                          "")
8                                                                                      ""Abid"
9                                                              ""Allerheiligen-Wasserfälle"")
10                                                                                    ""Banque
..                                                                                         ...
Variables not shown: t (int)

# f(∗,c)
context_freq <- word_context_word %>%
  dplyr::ungroup() %>%
  dplyr::group_by(context_word) %>%
  dplyr::summarize(c = n()) %>%
  print

Source: local data frame [432,863 x 2]

                      context_word    c
1                                !   25
2                              !*@    5
3                               !,    6
4                                " 1890
5                               ""   57
6                              "")   34
7                          ""Abid"    7
8  ""Allerheiligen-Wasserfälle"")    4
9                         ""Banque    8
10                      ""Dampf"")    7
..                             ...  ...

# N
nrow(co_word_pair)

[1] 21748822

84. 単語文脈行列の作成

83の出力を利用し，単語文脈行列Xを作成せよ．ただし，行列Xの各要素Xtcは次のように定義する．
- f(t,c) ≥ 10ならば，\(X_{tc}\)=PPMI(t,c)=\(max \{log\frac{N \times f(t,c)}{f(t,∗) \times f(∗,c)},0 \}\)
- f(t,c) < 10ならば，\(X_{tc}\)=0
ここで，PPMI(t,c)はPositive Pointwise Mutual Information（正の相互情報量）と呼ばれる統計量である．なお，行列Xの行数・列数は数百万オーダとなり，行列のすべての要素を主記憶上に載せることは無理なので注意すること．幸い，行列Xのほとんどの要素は0になるので，非0の要素だけを書き出せばよい．

SET_THRESHOLD <- 10


ppmi <- suppressWarnings(
  dplyr::left_join(
    x = dplyr::left_join(
      x = co_word_pair %>%
        dplyr::filter(tc >= SET_THRESHOLD),
      y = word_freq,
      by = "word"
    ),
    y = context_freq,
    by = "context_word"
  ) %>%
  dplyr::mutate(ppmi = log((nrow(co_word_pair) * tc) / (t * c)))
)

# 桁あふれによるNaN, NA化した値をゼロに変換
ppmi$ppmi <- ifelse(ppmi$ppmi > 0, ppmi$ppmi, 0)
ppmi <- ppmi %>%
  dplyr::select(word, context_word, ppmi) %>%
  replace(is.na(.), 0)


# 疎行列化（単純にtidyr::spreadをすると、メモリ消費が大きい）
occur_words <- unique(word_freq$word)
ppmi$word <- factor(ppmi$word, levels = as.character(occur_words))
ppmi$context_word <- factor(ppmi$context_word, levels = as.character(occur_words))
word_context_ppmi <- Matrix::sparseMatrix(
  i = as.integer(ppmi$word), j = as.integer(ppmi$context_word),
  x = ppmi$ppmi,
  dims = c(length(occur_words), length(occur_words)),
  dimnames = list(occur_words, occur_words)
)

85. 主成分分析による次元圧縮

84で得られた単語文脈行列に対して，主成分分析を適用し，単語の意味ベクトルを300次元に圧縮せよ．

SET_DIM_NUM <- 300


# 疎行列に対応している特異値分解(SVD)の関数を使う
# ここではさらに左・右の特異ベクトルを近似した手法を用いる
# （理解が足らないので「所感」で触れた資料や書籍で勉強し直す）
lsi_res <- irlba::irlba(
  A = word_context_ppmi,
  nu = SET_DIM_NUM, nv = SET_DIM_NUM
)
word_sence <- t(
  t(lsi_res$u[, seq(from = 1, to = SET_DIM_NUM), drop = FALSE]) %*% word_context_ppmi
)
rownames(word_sence) <- rownames(word_context_ppmi)

86. 単語ベクトルの表示

85で得た単語の意味ベクトルを読み込み，“United States”のベクトルを表示せよ．ただし，“United States”は内部的には“United_States”と表現されていることに注意せよ．

SET_SEARCH_WORD <- "United_States"


# 計算しておいた意味ベクトルから行名でマッチング
word_sence[is.element(rownames(word_sence), SET_SEARCH_WORD), ]

  [1]  2.037838e+00 -9.375344e-01 -3.106189e-01 -1.605345e-01 -5.904840e-02
  [6]  6.518050e-01  2.602280e-02  5.277793e-02 -4.068961e-01  2.610499e-01
 [11]  3.918249e-01 -4.765525e-02  6.441821e-02  3.308307e-01 -7.000656e-01
 [16] -1.136182e+00 -7.595355e-01  1.312867e+00  2.849231e-01 -5.978687e-01
 [21] -2.377491e-01  2.204588e-01 -4.045608e-01  7.395452e-01  1.753993e-01
 [26] -5.848825e-01  7.487949e-02 -8.271620e-02  3.057006e-01 -2.858766e-01
 [31] -1.626025e-14 -7.482628e-02  5.950323e-01  4.345851e-01  1.774483e-01
 [36]  3.518586e-01 -1.127549e-01  1.978918e-02 -2.124593e-02  2.632636e-01
 [41]  1.988843e-03 -1.085607e-01 -5.579423e-02 -6.353527e-01 -8.574127e-02
 [46] -2.702903e-01 -1.086915e-01 -9.561545e-02  3.506378e-01 -1.136618e-01
 [51] -4.273408e-02  2.581737e-01  4.037139e-01 -5.753142e-02 -4.511197e-02
 [56]  3.837357e-01 -3.667011e-01 -5.704303e-01  4.766412e-01 -5.312001e-02
 [61]  5.940559e-01 -9.484519e-01  8.184152e-01 -2.463707e-01  4.159756e-01
 [66]  8.324350e-01  1.780600e-01 -2.443294e-01 -2.646356e-02  1.453572e-01
 [71]  2.839435e-01  2.313725e-01  6.264211e-01 -3.446654e-01  6.386369e-01
 [76]  3.006067e-02 -3.472549e-01  5.245093e-01  6.549072e-01  2.267178e-01
 [81] -1.744061e-01  1.544763e-01 -1.885759e-01  3.261650e-01  1.818196e-01
 [86] -5.395743e-01  5.581458e-01 -5.187504e-01  1.807359e-01 -3.550431e-01
 [91] -1.276461e-02 -8.716716e-02 -1.975487e-01  1.522283e-01  3.874773e-01
 [96]  1.084189e-02  1.789052e-01 -3.046082e-02  4.413375e-02  8.527400e-02
[101]  1.860033e-01 -1.483027e-02  8.239071e-01  2.443411e-01 -1.125405e-01
[106] -7.860227e-02  2.509777e-01  4.514155e-03 -4.874264e-01  4.044995e-02
[111]  2.337708e-01 -4.252291e-01 -8.265042e-01 -6.330464e-01 -1.257250e-01
[116] -1.402304e-01  5.298778e-01  8.078769e-02  1.996961e-01 -1.518596e-01
[121]  4.562999e-01 -5.120750e-01  1.661834e-01 -3.112895e-02  1.371840e-01
[126]  2.215744e-01 -4.915658e-01 -5.562550e-02  4.136874e-01  2.518635e-01
[131] -1.942698e-01  4.185125e-01  2.838158e-01 -1.023195e+00  2.331462e-01
[136]  3.405410e-01  2.619637e-01  1.839301e-01  4.260956e-03 -7.053745e-02
[141] -3.222604e-02  1.703391e-01 -1.908760e-01 -7.297733e-01  3.924142e-01
[146]  2.141091e-01 -8.826797e-02 -1.787676e-12 -3.463508e-01 -2.527450e-01
[151]  1.264967e-01  2.092116e-01 -1.912777e-01  2.862594e-01 -1.797915e-01
[156]  3.921730e-03 -1.882014e-01  3.413298e-01  7.014191e-02 -4.041935e-02
[161]  3.021720e-01 -5.929334e-02 -1.431572e-01 -4.116285e-01  2.270619e-01
[166] -1.322969e-01 -4.940011e-01 -3.803055e-01 -1.353691e-01 -5.702957e-01
[171] -1.425753e-01  2.804969e-01  9.874140e-02 -5.764361e-01 -1.951386e-01
[176]  3.810475e-01 -1.868904e-01  3.822573e-01 -6.042654e-02 -2.721631e-01
[181]  3.634367e-01  2.552693e-01 -2.499335e-02  1.852346e-01  6.149274e-01
[186] -2.711030e-01 -1.853175e-01 -1.172159e-01 -3.391866e-01 -3.626279e-02
[191]  5.045489e-01 -2.980203e-01  2.137921e-01 -4.796995e-01  2.405535e-01
[196] -9.349896e-01  5.770628e-02 -4.376850e-01 -3.033761e-01 -3.721689e-01
[201] -4.449024e-01 -4.249434e-01 -2.055324e-02  2.367674e-01  5.078755e-01
[206]  2.513147e-01  1.643741e-01 -1.901829e-01  5.284053e-03 -2.717752e-01
[211] -1.348662e-01  1.102604e-01 -1.819928e-01  3.032791e-01  8.319814e-02
[216] -1.158638e-01 -2.505669e-01 -2.838608e-03 -1.407063e-01  4.587667e-01
[221]  4.361180e-01 -3.722908e-01 -6.295600e-02 -4.343454e-02 -6.277148e-02
[226] -1.120107e-01 -1.999548e-01 -2.116093e-01 -7.791790e-01 -5.385965e-02
[231] -2.015585e-01 -2.973695e-01 -9.012819e-02  7.317029e-02  5.602437e-01
[236] -3.891752e-02 -7.414676e-01  3.921536e-02  2.230800e-01 -5.484260e-01
[241]  9.380090e-02  3.474563e-01  4.400047e-03 -8.169646e-01 -6.850005e-02
[246] -1.643937e-01 -2.460247e-12 -4.393261e-02 -5.913047e-01 -4.294057e-02
[251] -2.508756e-02  5.760611e-02 -2.234026e-01 -3.073265e-02 -3.955376e-02
[256] -1.963136e-01 -3.460861e-01  8.795343e-02  4.492767e-01  3.553542e-01
[261] -7.870723e-01 -4.611821e-01  1.242627e-01  1.111002e-02  4.436061e-01
[266]  5.281246e-01  2.755074e-01 -4.890112e-01  1.931703e-01 -2.591270e-02
[271] -2.714101e-01 -3.758363e-01  5.954379e-01  1.409803e-01  6.588941e-01
[276]  1.283339e-01  1.405396e-01 -2.381568e-01 -3.367582e-01 -1.644492e-01
[281]  4.049569e-02  2.608836e-01 -3.598677e-01  8.669979e-01 -1.865976e-01
[286]  3.423840e-01 -1.029759e-01  2.729978e-01  3.205047e-01 -2.795269e-01
[291] -1.308581e-01  6.806350e-02  1.276574e-01  9.465147e-02 -1.063264e-01
[296]  8.932431e-02 -2.399941e-01  3.492260e-01 -3.412677e-01 -3.299891e-01

87. 単語の類似度

85で得た単語の意味ベクトルを読み込み，“United States”と“U.S.”のコサイン類似度を計算せよ．ただし，“U.S.”は内部的に“U.S”と表現されていることに注意せよ．

# コサイン類似度を{Rcpp}で書く
# 今回は使わない
rcpp_cosine_sim <- '
  NumericVector calcCosineSim(NumericMatrix ipt_mat) {
    using namespace Rcpp;
    Environment base("package:base");
    Function Rcrossprod = base["crossprod"];
    Function Rtranspose = base["t"];
    Function Router = base["outer"];
    
    NumericMatrix numerator = Rcrossprod(Rtranspose(ipt_mat));
    NumericVector denominator = diag(numerator);
    return Rcpp::wrap(numerator / sqrt(Router(denominator, denominator)));
  }
'
calcCosine <- Rcpp::cppFunction(code = rcpp_cosine_sim, plugins = "cpp11")

# コサイン類似度の必要な行だけを取り出す（N * Nの行列の出力を避ける）
filterCosineSim <- function (
  seed_word_vector, target_word_vectors, 
  extract_rownames = NULL
) {
  word_vectors <- rbind(seed_word_vector, target_word_vectors)
  numerator <- crossprod(x = t(x = word_vectors))
  denominator <- diag(numerator)
  return((numerator / sqrt(outer(denominator, denominator)))[extract_rownames, ])
}

SET_COMPARE_WORDS <- c("United_States", "^U\\.S$")


filterCosineSim(
  seed_word_vector = word_sence[
    is.element(rownames(word_sence), SET_COMPARE_WORDS[1]), , drop = FALSE
  ],
  target_word_vectors = word_sence[
    stringr::str_detect(
      string = rownames(word_sence), pattern = SET_COMPARE_WORDS[-1]
    ), , drop = FALSE
  ], 
  extract_rownames = SET_COMPARE_WORDS[1]
)

United_States           U.S 
    1.0000000     0.7838752

88. 類似度の高い単語10件

85で得た単語の意味ベクトルを読み込み，“England”とコサイン類似度が高い10語と，その類似度を出力せよ．

# 適当な乱数を割り当てて、一度にコサイン類似度を求めるベクトル数を減らす
fetchCosineSimilarity <- function(
  seed_word_vector, target_words_sence, 
  seed_word_name,
  split_size
){
  
  # 疎行列から行列へ変換
  seed_word_vector <- t(apply(X = seed_word_vector, MARGIN = 1, FUN = as.matrix))
  target_words_sence <- data.frame(
    t(apply(X = target_words_sence, MARGIN = 1, FUN = as.matrix)), 
    stringsAsFactors = FALSE
  )
  
  target_words_sence$split <- sample.int(
    n = split_size, size = nrow(target_words_sence),
    replace = TRUE
  )
  fetch_cs <- lapply(
    X = split(
      x = target_words_sence[, !is.element(colnames(target_words_sence), "split")],
      f = target_words_sence$split
    ),
    FUN = function (target_sence) {
      cosine_sim_res <- filterCosineSim(
        seed_word_vector = seed_word_vector,
        target_word_vectors = as.matrix(target_sence), 
        extract_rownames = seed_word_name
      )[-1]
      return(cosine_sim_res[!is.nan(x = cosine_sim_res)])
    }
  )
  cs_names <- dplyr::combine(sapply(X = fetch_cs, FUN = names))
  fetch_cs <- dplyr::combine(fetch_cs)
  names(fetch_cs) <- cs_names

  return(fetch_cs)
}

TASK_EXTRACT_NUM <- 10
SET_SEED_WORD <- c("England")
SET_SPLIT_SIZE <- as.integer(nrow(word_sence) / 5000)


# 並列処理しなくても充分だったのでしていない
fetch_cs <- fetchCosineSimilarity(
  seed_word_vector = word_sence[is.element(rownames(word_sence), SET_SEED_WORD), , drop = FALSE],
  target_words_sence = word_sence[!is.element(rownames(word_sence), SET_SEED_WORD), , drop = FALSE],
  seed_word_name = SET_SEED_WORD,
  split_size = SET_SPLIT_SIZE
)

# 上位10語を表示
sort(fetch_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]

      Cheshire       Scotland       Wildside          Italy            NWA 
     0.6423472      0.5718693      0.5519467      0.4917252      0.4814824 
United_Kingdom          Japan        Germany          Wales        Ireland 
     0.4802970      0.4760204      0.4742704      0.4617970      0.4571639

89. 加法構成性によるアナロジー

85で得た単語の意味ベクトルを読み込み，vec(“Spain”) - vec(“Madrid”) + vec(“Athens”)を計算し，そのベクトルと類似度の高い10語とその類似度を出力せよ．

# 意味ベクトルを演算
# 「+」と「-」のみの演算子に対応（他の演算子はスルーするので注意）
createArithmeticWordVector <- function (
  word_sence, def_arithmetic
) {
  return(
    colSums(
      do.call(
        what = "rbind", 
        args = lapply(
          X = names(def_arithmetic),
          FUN = function (each_arithmetic) {
            return(
              switch(EXPR = as.character(def_arithmetic[each_arithmetic]),
                "+" = + word_sence[each_arithmetic, ],
                "-" = - word_sence[each_arithmetic, ]
              )
            )
          }
        )
      )
    )
  )
}

TASK_EXTRACT_NUM <- 10
SET_DEF_ARITHMETIC <- c("Spain" = "+", "Madrid" = "-", "Athens" = "+")
SET_SPLIT_SIZE <- as.integer(nrow(word_sence) / 5000)


# 演算対象から仮の名前を生成して、それで計算結果を絞り込み
create_arithmtic_word_name <- stringr::str_c(names(SET_DEF_ARITHMETIC), collapse = "_")
fetch_arithmetic_cs <- fetchCosineSimilarity(
  seed_word_vector = matrix(
    data = createArithmeticWordVector(
      word_sence = word_sence, def_arithmetic = SET_DEF_ARITHMETIC
    ),
    nrow = 1, ncol = ncol(word_sence),
    dimnames = list(create_arithmtic_word_name, NULL)
  ),
  target_words_sence = word_sence,
  seed_word_name = create_arithmtic_word_name,
  split_size = SET_SPLIT_SIZE
)

# 上位10語を表示
sort(fetch_arithmetic_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]

        Spain       Austria   Netherlands       Belgium      Brussels 
    0.9176357     0.9098281     0.9011710     0.8996146     0.8956092 
      Antwerp Télévisions        Turkey        France       Hungary 
    0.8954350     0.8952880     0.8932235     0.8909760     0.8909016

所感

言語処理100本ノック(2015年版)のベクトル空間法(I)の章をやってみました（データ量を減らして逃げてしまいました）。
今回は疎行列（スパース行列）化する際に{Matrix}のdgcMatrixクラスへ変換していますが、{SparseM}や {slam}などのパッケージでも疎行列は扱えます（一般的には{Matrix}が使われていて、多くのパッケージでもサポートされていてよいのではという意見や、{Matrix}と{slam}を比較した記事もあります）。
　SparseM: Sparse Linear Algebra
　slam: Sparse Lightweight Arrays and Matrices
– （なお、Rでテキストマイニングをする{tm}では、単語文書行列を作るときに{slam}を使えるようです）
　Term-Document Matrix {tm}
{Matrix}の疎行列のクラスには他にもdgRMatrixやdgTMatrixなどがありますが、どういうデータにはどういうクラスが向いていて、どういう風に変換しているかは今後調査します（上級ハンドブックに記載あり）。

showClass("sparseMatrix")

Virtual Class "sparseMatrix" [package "Matrix"]

Slots:
                        
Name:       Dim Dimnames
Class:  integer     list

Extends: 
Class "Matrix", directly
Class "mMatrix", by class "Matrix", distance 2

Known Subclasses: 
Class "diagonalMatrix", directly
Class "TsparseMatrix", directly
Class "CsparseMatrix", directly
Class "RsparseMatrix", directly
Class "dsparseMatrix", directly
Class "lsparseMatrix", directly
Class "nsparseMatrix", directly
Class "indMatrix", directly
Class "ddiMatrix", by class "diagonalMatrix", distance 2
Class "ldiMatrix", by class "diagonalMatrix", distance 2
Class "dgTMatrix", by class "TsparseMatrix", distance 2
Class "dtTMatrix", by class "TsparseMatrix", distance 2
Class "dsTMatrix", by class "TsparseMatrix", distance 2
Class "lgTMatrix", by class "TsparseMatrix", distance 2
Class "ltTMatrix", by class "TsparseMatrix", distance 2
Class "lsTMatrix", by class "TsparseMatrix", distance 2
Class "ngTMatrix", by class "TsparseMatrix", distance 2
Class "ntTMatrix", by class "TsparseMatrix", distance 2
Class "nsTMatrix", by class "TsparseMatrix", distance 2
Class "dgCMatrix", by class "CsparseMatrix", distance 2
Class "dtCMatrix", by class "CsparseMatrix", distance 2
Class "dsCMatrix", by class "CsparseMatrix", distance 2
Class "lgCMatrix", by class "CsparseMatrix", distance 2
Class "ltCMatrix", by class "CsparseMatrix", distance 2
Class "lsCMatrix", by class "CsparseMatrix", distance 2
Class "ngCMatrix", by class "CsparseMatrix", distance 2
Class "ntCMatrix", by class "CsparseMatrix", distance 2
Class "nsCMatrix", by class "CsparseMatrix", distance 2
Class "dgRMatrix", by class "RsparseMatrix", distance 2
Class "dtRMatrix", by class "RsparseMatrix", distance 2
Class "dsRMatrix", by class "RsparseMatrix", distance 2
Class "lgRMatrix", by class "RsparseMatrix", distance 2
Class "ltRMatrix", by class "RsparseMatrix", distance 2
Class "lsRMatrix", by class "RsparseMatrix", distance 2
Class "ngRMatrix", by class "RsparseMatrix", distance 2
Class "ntRMatrix", by class "RsparseMatrix", distance 2
Class "nsRMatrix", by class "RsparseMatrix", distance 2
Class "dgTMatrix", by class "dsparseMatrix", distance 2
Class "dtTMatrix", by class "dsparseMatrix", distance 2
Class "dsTMatrix", by class "dsparseMatrix", distance 2
Class "dgCMatrix", by class "dsparseMatrix", distance 2
Class "dtCMatrix", by class "dsparseMatrix", distance 2
Class "dsCMatrix", by class "dsparseMatrix", distance 2
Class "dgRMatrix", by class "dsparseMatrix", distance 2
Class "dtRMatrix", by class "dsparseMatrix", distance 2
Class "dsRMatrix", by class "dsparseMatrix", distance 2
Class "lgTMatrix", by class "lsparseMatrix", distance 2
Class "ltTMatrix", by class "lsparseMatrix", distance 2
Class "lsTMatrix", by class "lsparseMatrix", distance 2
Class "lgCMatrix", by class "lsparseMatrix", distance 2
Class "ltCMatrix", by class "lsparseMatrix", distance 2
Class "lsCMatrix", by class "lsparseMatrix", distance 2
Class "lgRMatrix", by class "lsparseMatrix", distance 2
Class "ltRMatrix", by class "lsparseMatrix", distance 2
Class "lsRMatrix", by class "lsparseMatrix", distance 2
Class "ngTMatrix", by class "nsparseMatrix", distance 2
Class "ntTMatrix", by class "nsparseMatrix", distance 2
Class "nsTMatrix", by class "nsparseMatrix", distance 2
Class "ngCMatrix", by class "nsparseMatrix", distance 2
Class "ntCMatrix", by class "nsparseMatrix", distance 2
Class "nsCMatrix", by class "nsparseMatrix", distance 2
Class "ngRMatrix", by class "nsparseMatrix", distance 2
Class "ntRMatrix", by class "nsparseMatrix", distance 2
Class "nsRMatrix", by class "nsparseMatrix", distance 2
Class "pMatrix", by class "indMatrix", distance 2

言語処理で扱うデータは大規模なスパースになる傾向があり、統計モデリングや機械学習するには逐次処理させるか、データを疎行列化（モデル式で疎行列化できるMatrix::sparse.model.matrixが便利かも）して、疎行列に対応した関数を適用する必要があります。{glmnet}や{xgboost}, {e1071}などに疎行列に対応した関数が定義されています（他にも{MatrixModels}や{sparsereg}も対応している模様（こちらは調査中））。
　Sparse Model Matrices for Generalized Linear Models
　The Sparse Matrix and {glmnet}
　MatrixModels: Modelling with Sparse And Dense Matrices
　sparsereg: Sparse Bayesian Models for Regression, Subgroup Analysis, and Panel Data
– また、前回の課題で使った{FeatureHashing}のhashed.model.matrix関数はdgcMatrix引数をTRUEにすると、疎行列として結果を返します（前回の課題72を参照のこと）。
主成分分析の課題では特異値分解した結果を使いました。この際、implicitly restarted Lanczos bidiagonalizationを実装した{irlba}を用いましたが、他にも乱拓法による近似もあります（よく理解していないので、下記に参照したリンクを記述）。
　Augmented Implicitly Restarted Lanczos Bidiagonalization Methods
　Partial Lanczos SVD methods for R
– 乱拓法
　Fast Randomized SVD
　SVD for sparse matrix in R
　RedSVD
　RcppEigen and SVD
– その他メモ
　大規模データ処理のための行列の低ランク近似 – SVD から用例ベースの行列分解まで –
　18 Matrix decomposition and latent semantic indexing (pp.369-384)
　主成分分析と特異値分解
　潜在的意味インデキシング（LSI）徹底入門
　今年のSIGKDDベストペーパーを実装・公開してみました

# RedSVDをHomebrewからインストールして、RからRedSVDを使おうとした際のメモ（not run）
# https://github.com/ntessore/homebrew-nt
brew tap ntessore/homebrew-nt
brew install redsvd

# テストで失敗する
# devtools::install_github("xiangze/RRedsvd")

本課題をこなすにあたり、下記の書籍で学んでおくといいと思います（自分復習用）。
　Rによるハイパフォーマンスコンピューティング
　プログラミングのための線形代数
　これなら分かる応用数学教室―最小二乗法からウェーブレットまで
個人的には、言語処理の課題よりも大規模データの処理を考える方が難しかったです。

実行環境

library(devtools)
devtools::session_info()

Session info --------------------------------------------------------------

 setting  value                       
 version  R version 3.2.1 (2015-06-18)
 system   x86_64, darwin13.4.0        
 ui       X11                         
 language (EN)                        
 collate  ja_JP.UTF-8                 
 tz       Asia/Tokyo

Packages ------------------------------------------------------------------

 package     * version     date      
 assertthat  * 0.1         2013-12-06
 codetools     0.2-11      2015-03-10
 colorspace    1.2-6       2015-03-11
 countrycode * 0.18        2014-12-29
 crayon        1.3.0       2015-06-05
 curl          0.9         2015-06-19
 DBI           0.3.1       2014-09-24
 devtools    * 1.8.0       2015-05-09
 digest        0.6.8       2014-12-31
 doParallel    1.0.8       2014-02-28
 doRNG         1.6         2014-03-07
 dplyr       * 0.4.2.9002  2015-07-25
 evaluate      0.7         2015-04-21
 foreach       1.4.2       2014-04-11
 formatR       1.2         2015-04-21
 ggplot2     * 1.0.1       2015-03-17
 git2r         0.10.1      2015-05-07
 gtable        0.1.2       2012-12-05
 hadleyverse * 0.1         2015-08-09
 haven       * 0.2.0       2015-04-09
 htmltools     0.2.6       2014-09-08
 inline      * 0.3.14      2015-04-13
 irlba       * 1.0.3       2014-01-25
 iterators     1.0.7       2014-04-11
 knitr       * 1.10.5      2015-05-06
 lattice       0.20-31     2015-03-30
 lazyeval    * 0.1.10.9000 2015-07-25
 lubridate   * 1.3.3       2013-12-31
 magrittr      1.5         2014-11-22
 MASS          7.3-41      2015-06-18
 Matrix      * 1.2-1       2015-06-01
 memoise       0.2.1       2014-04-22
 munsell       0.4.2       2013-07-11
 pforeach    * 1.3         2015-07-25
 pkgmaker      0.22        2014-05-14
 plyr        * 1.8.3       2015-06-12
 proto         0.3-10      2012-12-22
 R6            2.0.1       2014-10-29
 Rcpp        * 0.12.0      2015-07-26
 readr       * 0.1.1.9000  2015-07-25
 readxl      * 0.1.0       2015-04-14
 registry      0.2         2012-01-24
 reshape2      1.4.1       2014-12-06
 rmarkdown     0.7         2015-06-13
 rngtools      1.2.4       2014-03-06
 rversions     1.0.1       2015-06-06
 scales        0.2.5       2015-06-12
 stringi     * 0.5-5       2015-06-29
 stringr     * 1.0.0.9000  2015-07-25
 testthat    * 0.10.0      2015-05-22
 tidyr       * 0.2.0.9000  2015-07-25
 xml2        * 0.1.1       2015-06-02
 xtable        1.7-4       2014-09-12
 yaml          2.1.13      2014-06-12
 source                               
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.1)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/dplyr@75e8303)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (aaboyles/hadleyverse@16532fe)
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/lazyeval@ecb8dc0)     
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hoxo-m/pforeach@2c44f3b)     
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (RcppCore/Rcpp@6ae91cc)       
 Github (hadley/readr@f4a3956)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/stringr@380c88f)      
 CRAN (R 3.2.0)                       
 Github (hadley/tidyr@0dc87b2)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)

言語処理100本ノック第9章:ベクトル空間法 (I)

@yamano357

2015年9月4日

概要

Rコード

パッケージ読み込み

事前準備

80. コーパスの整形

81. 複合語からなる国名への対処

82. 文脈の抽出

83. 単語／文脈の頻度の計測

84. 単語文脈行列の作成

85. 主成分分析による次元圧縮

86. 単語ベクトルの表示

87. 単語の類似度

88. 類似度の高い単語10件

89. 加法構成性によるアナロジー

所感

実行環境

言語処理100本ノック 第9章:ベクトル空間法 (I)

@yamano357

2015年9月4日

概要

Rコード

パッケージ読み込み

事前準備

80. コーパスの整形

81. 複合語からなる国名への対処

82. 文脈の抽出

83. 単語／文脈の頻度の計測

84. 単語文脈行列の作成

85. 主成分分析による次元圧縮

86. 単語ベクトルの表示

87. 単語の類似度

88. 類似度の高い単語10件

89. 加法構成性によるアナロジー

所感

実行環境

言語処理100本ノック第9章:ベクトル空間法 (I)