前回までのrPubsページ
- 第1章:準備運動
- 第2章:UNIXコマンドの基礎
- 第3章:正規表現
- 第4章:形態素解析
- 第5章:構文解析
- 第6章:英語テキストの処理
- 第7章:データベース
- 第8章:機械学習
- 第9章:ベクトル空間法 (I)

前回引き続き、言語処理100本ノック（2015年版）を解きます。

（下記の『前書き（言語処理100本ノックについて）』は前回と同じです）

概要

前書き（言語処理100本ノックについて）
- 本稿では、東北大学の乾・岡崎研究室で公開されている言語処理100本ノック（2015年版）を、R言語で解いていきます。
- 改訂前の言語処理100本ノックも同様に上記研究室のサイトにあります。

上記のふたつをご覧いただき、下記に進んでいただけますと幸いです。

前書き（Rに関して）
- Rの構文や関数についての説明は一切ありませんので、あらかじめご了承ください。
- 本稿では、{base}にある文字列処理ではなく、{stringr}（1.0.0以上）とパイプ処理を極力用いております（{stringi}も処理に応じて活用していきます）。課題によってはパイプ処理でこなすのに向かない状況もありますので、あらかじめご了承ください。
- 今回はあらかじめコマンドを実行して作成したword2vecによる単語の分散表現を読み込んでおります。

参考ページ

{stringr}と{stringi}
　hadley/stringr
　RPubs - このパッケージがすごい2014: stringr
　stringiで輝く☆テキストショリスト
　 stringr 1.0.0を使ってみる
{readr}
　hadley/readr
　readr とは？
　readr 0.0.0.9000を使ってみる
word2vec
　word2vec
　jdeng/word2vec
　手持ちのMacBook Air(OS X 10.9.2)でword2vecを動かしてみる
　 danielfrg/word2vec
t-SNE
　t-SNE
　次元削除(t-SNE)
　jkrijthe/Rtsne
　tSNE
　RでSPADEとviSNEを使って次元削減と可視化

ご意見やご指摘など
- こうした方が良いやこういう便利な関数がある、間違いがあるなど、ご指摘をお待ちしております。
- 下記のいずれかでご連絡・ご報告いただけますと励みになります（なお、Gitに慣れていない人です）。
　Twitter, GitHub

Rコード

以下、ひたすら解いていきます。

パッケージ読み込み

# devtools::install_github("aaboyles/hadleyverse")
SET_LOAD_LIB <- c("knitr", "hadleyverse", "stringi", "lazyeval", "Matrix", "countrycode", "networkD3", "ggmap", "leaflet", "rgdal", "tsne")
sapply(X = SET_LOAD_LIB, FUN = library, character.only = TRUE, logical.return = TRUE)

##       knitr hadleyverse     stringi    lazyeval      Matrix countrycode 
##        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
##   networkD3       ggmap     leaflet       rgdal        tsne 
##        TRUE        TRUE        TRUE        TRUE        TRUE

load_packages <- as.character(na.omit(object = stringr::str_match(string = search(), pattern = "^package:(.*)")[, 2]))
knitr::opts_chunk$set(comment = NA)

事前準備

SET_TASK_FILE_NAMES <- list(
  # 81で作成したコーパス
  PREV_CORPUS = "enwiki-courpus.txt",
  # コマンドライン実行で作成するword2vecの結果のモデルファイル
  WORD2VEC_MODEL = "enwiki-word2vec.model",
  # 85で作成したベクトル表現
  WORDVEC = "word-vector.RData",
  # The WordSimilarity-353 Test Collectionのzipファイルの中で使うファイル
  SIMILARITY_EVAL = "combined.csv"
)

TASK_INPUT_URL <- list(
  # 単語アナロジーの評価データ
  ANALOGY_EVAL = "https://word2vec.googlecode.com/svn/trunk/questions-words.txt",
  # The WordSimilarity-353 Test Collectionの評価データ
  SIMILARITY_EVAL = "http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.zip"
)


# ファイル取得 
sapply(
  X = seq(from = 1, to = length(TASK_INPUT_URL)),
  FUN = function (i) {
    download.file(
      url = TASK_INPUT_URL[[i]], destfile = basename(path = TASK_INPUT_URL[[i]]), 
      method = "wget", quiet = FALSE
    )
    return(
      file.exists(file1 = basename(path = TASK_INPUT_URL[[i]]))
    )
  }
)

[1] TRUE TRUE

# 複数ファイルをまとめたzipファイルを解凍して、必要なファイルを取得
utils::unzip(
  zipfile = basename(path = TASK_INPUT_URL$SIMILARITY_EVAL), 
  files = SET_TASK_FILE_NAMES$SIMILARITY_EVAL
)

# 9章で定義した関数群を再度定義

# コサイン類似度の必要な行だけを取り出す（N * Nの行列の出力を避ける）
filterCosineSim <- function (
  seed_word_vector, target_word_vectors, 
  extract_rownames = NULL
) {
  word_vectors <- rbind(seed_word_vector, target_word_vectors)
  numerator <- crossprod(x = t(x = word_vectors))
  denominator <- diag(numerator)
  return((numerator / sqrt(outer(denominator, denominator)))[extract_rownames, ])
}

# 適当な乱数を割り当てて、一度にコサイン類似度を求めるベクトル数を減らす
fetchCosineSimilarity <- function(
  seed_word_vector, target_words_sence, 
  seed_word_name,
  split_size
){
  
  # 疎行列から行列へ変換
  seed_word_vector <- t(apply(X = seed_word_vector, MARGIN = 1, FUN = as.matrix))
  target_words_sence <- data.frame(
    t(apply(X = target_words_sence, MARGIN = 1, FUN = as.matrix)), 
    stringsAsFactors = FALSE
  )
  
  target_words_sence$split <- sample.int(
    n = split_size, size = nrow(target_words_sence),
    replace = TRUE
  )
  fetch_cs <- lapply(
    X = split(
      x = target_words_sence[, !is.element(colnames(target_words_sence), "split")],
      f = target_words_sence$split
    ),
    FUN = function (target_sence) {
      cosine_sim_res <- filterCosineSim(
        seed_word_vector = seed_word_vector,
        target_word_vectors = as.matrix(target_sence), 
        extract_rownames = seed_word_name
      )[-1]
      return(cosine_sim_res[!is.nan(x = cosine_sim_res)])
    }
  )
  cs_names <- dplyr::combine(sapply(X = fetch_cs, FUN = names))
  fetch_cs <- dplyr::combine(fetch_cs)
  names(fetch_cs) <- cs_names

  return(fetch_cs)
}

# 意味ベクトルを演算
# 「+」と「-」のみの演算子に対応（他の演算子はスルーするので注意）
createArithmeticWordVector <- function (
  word_sence, def_arithmetic
) {
  return(
    colSums(
      do.call(
        what = "rbind", 
        args = lapply(
          X = names(def_arithmetic),
          FUN = function (each_arithmetic) {
            return(
              switch(EXPR = as.character(def_arithmetic[each_arithmetic]),
                "+" = + word_sence[each_arithmetic, ],
                "-" = - word_sence[each_arithmetic, ]
              )
            )
          }
        )
      )
    )
  )
}

# 事前にモデルファイルを作成
# https://github.com/Homebrew/homebrew-head-only/
# brew install --HEAD https://raw.githubusercontent.com/Homebrew/homebrew-head-only/master/word2vec.rb

SET_WORD2_VEC_PARAM <- list(
  SIZE = 200, WINDOW = 5, SAMPLE = 0.0001,
  NEGATIVE = 5,
  CBOW = 1,
  ITER = 3
)

word2vec_training_command <- stringr::str_c(
  stringr::str_c(
    stringr::str_c("-", stringr::str_to_lower(string = names(SET_WORD2_VEC_PARAM))),
    SET_WORD2_VEC_PARAM,
    sep = " "
  )
)

# 下記のコマンドを実行
# word2vec -train enwiki-courpus.txt -output enwiki-word2vec.model -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3
# system2(
#   command = "word2vec",
#   args = stringr::str_c(
#     "-train", SET_TASK_FILE_NAMES$PREV_CORPUS, "-output", SET_TASK_FILE_NAMES$WORD2VEC_MODEL,
#     stringi::stri_flatten(str = word2vec_training_command, collapse = " "),
#    "-hs 0 -binary 0",
#    sep = " "
#  )
#)

90. word2vecによる学習

81で作成したコーパスに対してword2vecを適用し，単語ベクトルを学習せよ．さらに，学習した単語ベクトルの形式を変換し，86-89のプログラムを動かせ．

# word2vecを適用した結果を読み込み
word_vec <- stringr::str_split_fixed(
  string = readr::read_lines(file = SET_TASK_FILE_NAMES$WORD2VEC_MODEL, n_max = -1),
  pattern = "[:blank:]", n = SET_WORD2_VEC_PARAM$SIZE + 1
)[-1, ]
wordvec_matrix <- matrix(
  data = as.numeric(word_vec[, -1, drop = FALSE]),
  ncol = SET_WORD2_VEC_PARAM$SIZE,
  dimnames = list(word_vec[, 1], NULL)
)


# 86
# 85で得た単語の意味ベクトルを読み込み，"United States"のベクトルを表示せよ．ただし，"United States"は内部的には"United_States"と表現されていることに注意せよ．

SET_SEARCH_WORD <- "United_States"
wordvec_matrix[is.element(rownames(wordvec_matrix), SET_SEARCH_WORD), ]

  [1]  0.368590  0.127522 -0.021429 -1.400923  0.051056 -0.594420  0.708153
  [8] -0.536785  0.257927  1.473590  0.203767  0.256454  1.032107 -0.283982
 [15] -0.102399  0.023703  1.668537  0.428334  1.282311 -0.215650  0.202555
 [22]  1.015952  0.173203  0.739720  0.616364  0.060005  1.055462 -0.532724
 [29]  0.086540 -0.890822  0.714439 -0.533832 -0.732312 -0.452217 -0.344213
 [36]  0.899291  0.730087  0.694130  0.691109 -0.395891  0.099172  0.657461
 [43] -0.387319  0.882841  0.205190  0.424588 -0.331335  0.214957  0.079813
 [50] -0.372798  0.261392  1.334000 -0.963945  0.128226  0.496294 -2.025836
 [57] -0.471818 -0.646273 -0.004088  0.298654  0.191489 -0.679419  0.495008
 [64]  0.243389 -0.533670 -0.371724 -0.555603  0.530962 -0.562673 -0.469252
 [71] -0.787474  0.414495  1.211043 -0.996906 -0.078796 -0.017270 -0.815958
 [78]  1.651180 -0.576896 -0.045888  0.075785  0.162598 -0.196871  0.840418
 [85]  0.216479 -0.689878  0.636621 -0.951645  0.264623  0.065514 -0.739251
 [92]  0.635188  0.226935 -0.258729 -0.739013  0.503876  0.273261  0.050722
 [99]  1.103913 -0.743350 -0.112372  0.492332 -1.240996 -0.384699 -1.133474
[106]  1.056272 -0.062803 -0.911683 -0.625171  0.807135 -0.996981  0.406196
[113]  0.712704  0.898434 -0.112743 -0.562427 -0.040505  0.950574  0.553425
[120] -0.104227 -0.164486 -0.033910  0.017812 -0.199443 -0.552868 -0.004037
[127] -0.204095 -0.779302 -1.203280 -0.687044  1.454973 -0.807441 -0.767562
[134] -0.558480  0.525875 -0.020303  1.269005  0.492838 -1.056686 -0.933681
[141]  0.146587 -0.466178 -0.340006  0.695552  0.215535  0.352032 -0.673910
[148] -0.902938 -0.463121 -0.782760  0.789848 -1.007588  1.192383 -0.909179
[155]  0.199025  0.153629 -0.212183  0.302305 -0.532938  1.513391  1.733162
[162]  0.352315  1.465211 -0.640137  0.054654  0.351909 -0.255554  0.789699
[169] -0.343751  0.309863 -0.005365 -0.352621  0.865931  0.248544 -1.241198
[176]  0.434984  0.802694  0.149254 -0.416365  0.961740  0.139642  0.617373
[183] -0.314953 -0.068143  0.023058  0.838402  0.247863  0.574815 -0.142285
[190] -0.328798  0.152334  0.066419  0.153841  0.696354  0.481986 -0.251259
[197] -0.853765 -0.713053  1.306750  0.639339

# 87
# 85で得た単語の意味ベクトルを読み込み，"United States"と"U.S."のコサイン類似度を計算せよ．ただし，"U.S."は内部的に"U.S"と表現されていることに注意せよ．

SET_COMPARE_WORDS <- c("United_States", "^U\\.S$")
filterCosineSim(
  seed_word_vector = wordvec_matrix[
    is.element(rownames(wordvec_matrix), SET_COMPARE_WORDS[1]), , drop = FALSE
  ],
  target_word_vectors = wordvec_matrix[
    stringr::str_detect(
      string = rownames(wordvec_matrix), pattern = SET_COMPARE_WORDS[-1]
    ), , drop = FALSE
  ], 
  extract_rownames = SET_COMPARE_WORDS[1]
)

United_States           U.S 
     1.000000      0.782201

# 88
# 85で得た単語の意味ベクトルを読み込み，"England"とコサイン類似度が高い10語と，その類似度を出力せよ．

TASK_EXTRACT_NUM <- 10
SET_SEED_WORD <- c("England")
SET_SPLIT_SIZE <- as.integer(nrow(wordvec_matrix) / 5000)
fetch_cs <- fetchCosineSimilarity(
  seed_word_vector = wordvec_matrix[is.element(rownames(wordvec_matrix), SET_SEED_WORD), , drop = FALSE],
  target_words_sence = wordvec_matrix[!is.element(rownames(wordvec_matrix), SET_SEED_WORD), , drop = FALSE],
  seed_word_name = SET_SEED_WORD,
  split_size = SET_SPLIT_SIZE
)

# 上位10語を表示
sort(fetch_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]

  Scotland      Wales     London    Britain    Ireland     Sydney 
 0.7992750  0.7567336  0.6397057  0.6160645  0.6102886  0.6019140 
 Liverpool Lancashire   Somerset  England's 
 0.5959739  0.5739732  0.5693245  0.5673810

# 89
# 85で得た単語の意味ベクトルを読み込み，vec("Spain") - vec("Madrid") + vec("Athens")を計算し，そのベクトルと類似度の高い10語とその類似度を出力せよ．

TASK_EXTRACT_NUM <- 10
SET_DEF_ARITHMETIC <- c("Spain" = "+", "Madrid" = "-", "Athens" = "+")
SET_SPLIT_SIZE <- as.integer(nrow(wordvec_matrix) / 5000)
create_arithmtic_word_name <- stringr::str_c(names(SET_DEF_ARITHMETIC), collapse = "_")
fetch_arithmetic_cs <- fetchCosineSimilarity(
  seed_word_vector = matrix(
    data = createArithmeticWordVector(
      word_sence = wordvec_matrix, def_arithmetic = SET_DEF_ARITHMETIC
    ),
    nrow = 1, ncol = ncol(wordvec_matrix),
    dimnames = list(create_arithmtic_word_name, NULL)
  ),
  target_words_sence = wordvec_matrix,
  seed_word_name = create_arithmtic_word_name,
  split_size = SET_SPLIT_SIZE
)

# 上位10語を表示
sort(fetch_arithmetic_cs, decreasing = TRUE)[seq(from = 1, to = TASK_EXTRACT_NUM)]

    Spain    Greece     Italy    Russia     Egypt   Austria    Athens 
0.8230848 0.8111017 0.8057365 0.7773127 0.7739961 0.7633028 0.7555862 
    Syria   Hungary    Poland 
0.7545515 0.7379380 0.7376987

91. アナロジーデータの準備

単語アナロジーの評価データをダウンロードせよ．このデータ中で“:”で始まる行はセクション名を表す．例えば，“: capital-common-countries”という行は，“capital-common-countries”というセクションの開始を表している．ダウンロードした評価データの中で，“family”というセクションに含まれる評価事例を抜き出してファイルに保存せよ．

SET_EXTRACT_PATTERN <- list(
  SECTION_START = "^:", 
  TARGET_SECTION = "family"
)


# 全要素とセクションのID
read_analogy <- dplyr::data_frame(
  text = readr::read_lines(file = TASK_INPUT_URL$ANALOGY_EVAL, n_max = -1)
) %>%
  dplyr::mutate(
    section_id = cumsum(
      x = stringr::str_detect(string = .$text, pattern = SET_EXTRACT_PATTERN$SECTION_START)
    )
  )

# 必要なセクションのみ
analogy_eval_word <- read_analogy %>%
  dplyr::filter(
    is.element(
      el = .$section_id,
      set = read_analogy %>%
        dplyr::filter(
          stringr::str_detect(string = .$text, pattern = SET_EXTRACT_PATTERN$TARGET_SECTION)
        ) %>%
        .$section_id
    )
  ) %>%
  .$text
analogy_eval_word <- analogy_eval_word[-1]

# 適当な数だけ表示
head(x = analogy_eval_word, n = 20)

 [1] "boy girl brother sister"          "boy girl brothers sisters"       
 [3] "boy girl dad mom"                 "boy girl father mother"          
 [5] "boy girl grandfather grandmother" "boy girl grandpa grandma"        
 [7] "boy girl grandson granddaughter"  "boy girl groom bride"            
 [9] "boy girl he she"                  "boy girl his her"                
[11] "boy girl husband wife"            "boy girl king queen"             
[13] "boy girl man woman"               "boy girl nephew niece"           
[15] "boy girl policeman policewoman"   "boy girl prince princess"        
[17] "boy girl son daughter"            "boy girl sons daughters"         
[19] "boy girl stepbrother stepsister"  "boy girl stepfather stepmother"

92. アナロジーデータへの適用

91で作成した評価データの各事例に対して，vec(2列目の単語) - vec(1列目の単語) + vec(3列目の単語)を計算し，そのベクトルと類似度が最も高い単語と，その類似度を求めよ．求めた単語と類似度は，各事例の末尾に追記せよ．このプログラムを85で作成した単語ベクトル，90で作成した単語ベクトルに対して適用せよ．

createArithmticWordName <- function(
  target_word,
  set_arithmetic = c("1" = "-", "2" = "+", "3" = "+")
) {
  word_arithmetic <- as.character(set_arithmetic)
  names(word_arithmetic) <- target_word[as.integer(names(set_arithmetic))]
  return(list(word_arithmetic))
}

# 複数個のベクトルをseed_word_vectorにして、一度にコサイン類似度を計算
# （「fetchCosineSimilarity」を修正）
fetchMulutiCosineSimilarity <- function(
  seed_word_vector, target_words_sence, 
  seed_word_name,
  split_size
){
  
  # 疎行列から行列へ変換
  seed_word_vector <- t(apply(X = seed_word_vector, MARGIN = 1, FUN = as.matrix))
  target_words_sence <- data.frame(
    t(apply(X = target_words_sence, MARGIN = 1, FUN = as.matrix)), 
    stringsAsFactors = FALSE
  )
  
  target_words_sence$split <- sample.int(
    n = split_size, size = nrow(target_words_sence),
    replace = TRUE
  )
  fetch_cs <- lapply(
    X = split(
      x = target_words_sence[, !is.element(colnames(target_words_sence), "split")],
      f = target_words_sence$split
    ),
    FUN = function (target_sence) {
      # 「fetchCosineSimilarity」と下記が異なる
      cosine_sim_res <- filterCosineSim(
        seed_word_vector = seed_word_vector,
        target_word_vectors = as.matrix(target_sence), 
        extract_rownames = seed_word_name
      )
      cosine_sim_res <- cosine_sim_res[, !is.element(
        el = colnames(cosine_sim_res), set = rownames(seed_word_vector))
      ]
      return(
        na.omit(
          object = replace(x = cosine_sim_res, list = is.nan(cosine_sim_res), values = 0)
        )
      )
    }
  )
  fetch_cs <- do.call(what = "cbind", args = fetch_cs)
  return(fetch_cs)
}

applyAnalogy <- function (
  wordvec_analogy, wordvec_matrix,
  apply_arithmetic_param = list(
    arithmetic_pattern = c("1" = "-", "2" = "+", "3" = "+"),
    extract_size = 10
  ),
  is_sort = TRUE
) {

  create_wordvec_arithmtic_word_name <- apply(
    X = wordvec_analogy[, 1:3], 
    MARGIN = 1,
    FUN = stringr::str_c, collapse = "_"
  )
  
  def_wordvec_arithmetic_lst <- do.call(
    what = "rbind",
    args = apply(
      X = wordvec_analogy,
      MARGIN = 1,
      FUN = createArithmticWordName, 
      set_arithmetic = apply_arithmetic_param$arithmetic_pattern
    )
  )[, ]
  
  
  fetch_arithmetic_cs <- fetchMulutiCosineSimilarity(
    seed_word_vector = matrix(
      data = sapply(
        X = def_wordvec_arithmetic_lst,
        FUN = createArithmeticWordVector,
        word_sence = wordvec_matrix
      ),
      nrow = length(def_wordvec_arithmetic_lst), ncol = ncol(wordvec_matrix),
      byrow = TRUE,
      dimnames = list(create_wordvec_arithmtic_word_name, NULL)
    ),
    target_words_sence = wordvec_matrix,
    seed_word_name = create_wordvec_arithmtic_word_name,
    split_size = as.integer(nrow(wordvec_matrix) / 5000)
  )
  
  return (
    do.call(
      what = "rbind", 
      args = lapply(
        X = seq(from = 1, to = length(def_wordvec_arithmetic_lst)),
        FUN = function (i) {
          fetched_arithmetic_vec <- fetch_arithmetic_cs[i, ]
          fetched_arithmetic_vec <- fetched_arithmetic_vec[
            setdiff(x = names(fetched_arithmetic_vec), names(def_wordvec_arithmetic_lst[[i]]))
          ]
          if (is_sort) {
            fetched_arithmetic_vec <- sort(x = fetched_arithmetic_vec, decreasing = TRUE)
          }
          fetched_arithmetic_vec <- fetched_arithmetic_vec[
            seq(from = 1, to = apply_arithmetic_param$extract_size)
          ]
          
          return(
            list(
              word = names(fetched_arithmetic_vec),
              similarity = as.numeric(fetched_arithmetic_vec)
            )
          )
        }
      )
    )
  )
}

splitWordVector <- function (target_lst) {
   return(
     stringr::str_split_fixed(
       string = stringr::str_c(
         mapply(target_lst$word, target_lst$similarity, FUN = stringr::str_c, sep = ":"),
         collapse = ":"
       ),
       pattern = ":",
       n = length(target_lst$word) * 2
    )
  )
}

SET_APPLY_ARITHMETIC <- list(
  ARITHMETIC_PATTERN = c("1" = "-", "2" = "+", "3" = "+"),
  EXTRACT_SIZE = 10,
  WRITE_TOP_SIM = 1
)


# アナロジーデータを読み込み、アナロジーのパターンリストを作成
analogy_eval_word_mat <- stringr::str_split_fixed(
  string = analogy_eval_word, pattern = "[:space:]", n = 4
)
create_arithmtic_word_name <- apply(
  X = analogy_eval_word_mat[, 1:3], 
  MARGIN = 1,
  FUN = stringr::str_c, collapse = "_"
)
def_arithmetic_lst <- do.call(
  what = "rbind",
  args = apply(
    X = analogy_eval_word_mat,
    MARGIN = 1,
    FUN = createArithmticWordName, 
    set_arithmetic = SET_APPLY_ARITHMETIC$ARITHMETIC_PATTERN
  )
)[, ]
include_analogy_word <- unique(as.character(analogy_eval_word_mat[, 1:3]))


# word2vec
# アナロジーの正解データにありword2vecにない単語があるので、これを除外（wordvec_analogy_eval）
setdiff(x = include_analogy_word, y = rownames(wordvec_matrix))

[1] "grandpa"     "stepbrother" "grandma"     "policewoman" "stepsister"

wordvec_analogy_eval <- analogy_eval_word_mat[!apply(
  X = !apply(
    X = analogy_eval_word_mat, MARGIN = 1, FUN = is.element, set = rownames(wordvec_matrix)
  ),
  MARGIN = 2,
  FUN = any
), ]
# すべて含まれる
setdiff(x = unique(as.character(wordvec_analogy_eval[, 1:3])), y = rownames(wordvec_matrix))

character(0)

# アナロジーによる演算の結果をTopN個取得
wordvec_arithmetic_top_n <- applyAnalogy(
  wordvec_analogy = wordvec_analogy_eval,
  wordvec_matrix = wordvec_matrix,
  apply_arithmetic_param = list(
    arithmetic_pattern = SET_APPLY_ARITHMETIC$ARITHMETIC_PATTERN,
    extract_size = SET_APPLY_ARITHMETIC$EXTRACT_SIZE
  ),
  is_sort = TRUE
)

# タスクに必要分だけに限定
analogy_append_wordvec_res <- cbind(
  wordvec_analogy_eval,
  t(apply(
    X = wordvec_arithmetic_top_n,
    MARGIN = 1,
    FUN = splitWordVector
  ))[, seq(from = 1, to = SET_APPLY_ARITHMETIC$WRITE_TOP_SIM * 2)]
)
head(x = analogy_append_wordvec_res, n = 20)

      [,1]      [,2]     [,3]          [,4]            [,5]           
 [1,] "boy"     "girl"   "brother"     "sister"        "cousin"       
 [2,] "boy"     "girl"   "brothers"    "sisters"       "sisters"      
 [3,] "boy"     "girl"   "dad"         "mom"           "kiss"         
 [4,] "boy"     "girl"   "father"      "mother"        "mother"       
 [5,] "boy"     "girl"   "grandfather" "grandmother"   "grandmother"  
 [6,] "boy"     "girl"   "grandson"    "granddaughter" "granddaughter"
 [7,] "boy"     "girl"   "groom"       "bride"         "kiss"         
 [8,] "boy"     "girl"   "he"          "she"           "she"          
 [9,] "boy"     "girl"   "his"         "her"           "her"          
[10,] "boy"     "girl"   "husband"     "wife"          "wife"         
[11,] "boy"     "girl"   "king"        "queen"         "prince"       
[12,] "boy"     "girl"   "man"         "woman"         "woman"        
[13,] "boy"     "girl"   "nephew"      "niece"         "Joan"         
[14,] "boy"     "girl"   "prince"      "princess"      "consort"      
[15,] "boy"     "girl"   "son"         "daughter"      "daughter"     
[16,] "boy"     "girl"   "sons"        "daughters"     "daughters"    
[17,] "boy"     "girl"   "stepfather"  "stepmother"    "prostitute"   
[18,] "boy"     "girl"   "stepson"     "stepdaughter"  "Milian"       
[19,] "boy"     "girl"   "uncle"       "aunt"          "grandmother"  
[20,] "brother" "sister" "brothers"    "sisters"       "sisters"      
      [,6]               
 [1,] "0.828238966901142"
 [2,] "0.835280785156725"
 [3,] "0.723447611619134"
 [4,] "0.89118786494309" 
 [5,] "0.806032960904853"
 [6,] "0.78369237460507" 
 [7,] "0.654958950727345"
 [8,] "0.779895379702318"
 [9,] "0.740919686894226"
[10,] "0.755086220492689"
[11,] "0.83725477167027" 
[12,] "0.843906776965192"
[13,] "0.820468618837243"
[14,] "0.80662089494703" 
[15,] "0.874979776555592"
[16,] "0.878910715287467"
[17,] "0.627205063868474"
[18,] "0.541829975123195"
[19,] "0.84014370826849" 
[20,] "0.633077911377503"

# LSI
# 前の章で保存した単語ベクトルデータを読み込み
load(file = SET_TASK_FILE_NAMES$WORDVEC)

# アナロジーの正解データにあり単語の意味ベクトルにない単語があるので、これを除外（word_sence_analogy_eval）
setdiff(x = include_analogy_word, y = rownames(word_sence))

[1] "grandpa"

word_sence_analogy_eval <- analogy_eval_word_mat[!apply(
  X = !apply(
    X = analogy_eval_word_mat, MARGIN = 1, FUN = is.element, set = rownames(word_sence)
  ),
  MARGIN = 2,
  FUN = any
), ]
# すべて含まれる
setdiff(x = unique(as.character(word_sence_analogy_eval[, 1:3])), y = rownames(word_sence))

character(0)

# アナロジーによる演算の結果をTopN個取得
word_sence_arithmetic_top_n <- applyAnalogy(
  wordvec_analogy = word_sence_analogy_eval,
  wordvec_matrix = word_sence,
  apply_arithmetic_param = list(
    arithmetic_pattern = SET_APPLY_ARITHMETIC$ARITHMETIC_PATTERN,
    extract_size = SET_APPLY_ARITHMETIC$EXTRACT_SIZE
  ),
  is_sort = TRUE
)

# タスクに必要分だけに限定
analogy_append_word_sence_res <- cbind(
  word_sence_analogy_eval,
  t(apply(
    X = word_sence_arithmetic_top_n,
    MARGIN = 1,
    FUN = splitWordVector
  ))[, seq(from = 1, to = SET_APPLY_ARITHMETIC$WRITE_TOP_SIM * 2)]
)
head(x = analogy_append_word_sence_res, n = 20)

      [,1]  [,2]   [,3]          [,4]            [,5]         
 [1,] "boy" "girl" "brother"     "sister"        "son"        
 [2,] "boy" "girl" "brothers"    "sisters"       "sisters"    
 [3,] "boy" "girl" "dad"         "mom"           "memoir"     
 [4,] "boy" "girl" "father"      "mother"        "elder"      
 [5,] "boy" "girl" "grandfather" "grandmother"   "grandmother"
 [6,] "boy" "girl" "grandson"    "granddaughter" "nephew"     
 [7,] "boy" "girl" "groom"       "bride"         "realist"    
 [8,] "boy" "girl" "he"          "she"           "she"        
 [9,] "boy" "girl" "his"         "her"           "Evelyn"     
[10,] "boy" "girl" "husband"     "wife"          "divorced"   
[11,] "boy" "girl" "king"        "queen"         "loved"      
[12,] "boy" "girl" "man"         "woman"         "woman"      
[13,] "boy" "girl" "nephew"      "niece"         "grandson"   
[14,] "boy" "girl" "policeman"   "policewoman"   "realist"    
[15,] "boy" "girl" "prince"      "princess"      "realist"    
[16,] "boy" "girl" "son"         "daughter"      "brother"    
[17,] "boy" "girl" "sons"        "daughters"     "daughters"  
[18,] "boy" "girl" "stepbrother" "stepsister"    "realist"    
[19,] "boy" "girl" "stepfather"  "stepmother"    "companion"  
[20,] "boy" "girl" "stepson"     "stepdaughter"  "realist"    
      [,6]               
 [1,] "0.848257364457638"
 [2,] "0.844411545623806"
 [3,] "0.970002548068282"
 [4,] "0.801289765138797"
 [5,] "0.977728842590744"
 [6,] "0.913754451664903"
 [7,] "0.717806591212509"
 [8,] "0.956029370755391"
 [9,] "0.979569439080309"
[10,] "0.864599009627346"
[11,] "0.605652505953952"
[12,] "0.797773656920137"
[13,] "0.946841991832837"
[14,] "0.718301742623418"
[15,] "0.726839227182708"
[16,] "0.849325572067868"
[17,] "0.916212662302662"
[18,] "0.717806591212509"
[19,] "0.975385831590676"
[20,] "0.717806591212509"

93. アナロジータスクの正解率の計算

92で作ったデータを用い，各モデルのアナロジータスクの正解率を求めよ．

# データの列番号を指定
SET_ANALOGY_COL_PROF <- list(
  TRUE_COL = 4, SELECT_COL = 5
)


# word2vec
# 全アナロジータスクで一致
sum(
  analogy_append_wordvec_res[, SET_ANALOGY_COL_PROF$TRUE_COL] ==
  analogy_append_wordvec_res[, SET_ANALOGY_COL_PROF$SELECT_COL]
) / nrow(analogy_eval_word_mat)

[1] 0.3221344

# word2vecのモデルとアナロジータスクのデータが対応した数
nrow(analogy_append_wordvec_res)

[1] 380

# LSI
# 全アナロジータスクで一致
sum(
  analogy_append_word_sence_res[, SET_ANALOGY_COL_PROF$TRUE_COL] ==
  analogy_append_word_sence_res[, SET_ANALOGY_COL_PROF$SELECT_COL]
) / nrow(analogy_eval_word_mat)

[1] 0.1462451

# LSIのモデルとアナロジータスクのデータが対応した数
nrow(analogy_append_word_sence_res)

[1] 462

94. WordSimilarity-353での類似度計算

The WordSimilarity-353 Test Collectionの評価データを入力とし，1列目と2列目の単語の類似度を計算し，各行の末尾に類似度の値を追加するプログラムを作成せよ．このプログラムを85で作成した単語ベクトル，90で作成した単語ベクトルに対して適用せよ．

# 類似度行列から単語ペアの類似度を抽出
# word_1, word_2, similarity
extractSimi <- function (
  word_1, word_2,
  sim_mat
){

  if (is.element(el = word_1, set = colnames(x = sim_mat)) &
      is.element(el = word_2, set = colnames(x = sim_mat))
  ) {
    return(
      dplyr::data_frame(
        word_1 = word_1,
        word_2 = word_2,
        similarity = sim_mat[word_1, word_2]
      )
    )
  } else{
    return(
      dplyr::data_frame(
        word_1 = word_1,
        word_2 = word_2,
        similarity = 0
      )
    )
  }
}

extractWordVecSim <- function (
  target_words,
  word_sim_mat, word_sim_word
) {
  
  # 単語数が少ないので一度にコサイン類似度を求める
  wordvec_sim <- as.matrix(x = 
    filterCosineSim(
      seed_word_vector = word_sim_mat,
      target_word_vectors = NULL,
      extract_rownames = rownames(x = word_sim_mat)
    ) %>%
      replace(x = ., list = is.na(.), values = 0)
  )
  diag(x = wordvec_sim) <- 1

  # 計算した類似度行列を使用して、単語同士の類似度を出力
  return(
    dplyr::bind_rows(
      target_words %>%
        dplyr::rowwise(.) %>%
        dplyr::do(
          word2vec_sim = extractSimi(
            word_1 = .$word_1, word_2 = .$word_2,
            sim_mat = wordvec_sim
          )
        ) %>%
        .$word2vec_sim
      )
  )
}


# 単語ペアの正解データを読み込み
read_wordsim <- readr::read_csv(
  file = SET_TASK_FILE_NAMES$SIMILARITY_EVAL, n_max = -1, skip = 1,
  col_names = c("word_1", "word_2", "similarity_score")
)
similarity_word <- unique(as.character(unlist(read_wordsim[, 1:2])))


# word2vec
# 類似度の正解データにありword2vecの結果にない単語があるので、これを除く（word_sim_mat） 
word2vec_sim_word <- rownames(wordvec_matrix)
word2vec_sim <- dplyr::left_join(
  x = read_wordsim,
  y = extractWordVecSim(
    target_words = read_wordsim,
    word_sim_mat =  wordvec_matrix[
      is.element(el = word2vec_sim_word, set = similarity_word),
    ],
    word_sim_word = word2vec_sim_word
  ),
  by = c("word_1" = "word_1", "word_2" = "word_2")
)
head(x = word2vec_sim, n = 15)

Source: local data frame [15 x 4]

       word_1        word_2 similarity_score similarity
1        love           sex             6.77  0.5684097
2       tiger           cat             7.35  0.8221492
3       tiger         tiger            10.00  1.0000000
4        book         paper             7.46  0.4952797
5    computer      keyboard             7.62  0.6370476
6    computer      internet             7.58  0.6634256
7       plane           car             5.77  0.4916155
8       train           car             6.31  0.5546425
9   telephone communication             7.50  0.5873272
10 television         radio             6.77  0.7543650
11      media         radio             7.42  0.5272615
12       drug         abuse             6.85  0.7414397
13      bread        butter             6.19  0.8939361
14   cucumber        potato             5.92  0.7951237
15     doctor         nurse             7.00  0.6825135

# LSI
# 類似度の正解データにあり単語の意味ベクトルにない単語があるので、これを除く（word_sim_mat） 
wordsence_sim_word <- rownames(word_sence)
wordsence_sim <- dplyr::left_join(
  x = read_wordsim,
  y = extractWordVecSim(
    target_words = read_wordsim,
    word_sim_mat =  word_sence[
      is.element(el = wordsence_sim_word, set = similarity_word),
    ],
    word_sim_word = wordsence_sim_word
  ),
  by = c("word_1" = "word_1", "word_2" = "word_2")
)
head(x = wordsence_sim, n = 15)

Source: local data frame [15 x 4]

       word_1        word_2 similarity_score similarity
1        love           sex             6.77 0.06095775
2       tiger           cat             7.35 0.90669539
3       tiger         tiger            10.00 1.00000000
4        book         paper             7.46 0.37100028
5    computer      keyboard             7.62 0.03737423
6    computer      internet             7.58 0.16836935
7       plane           car             5.77 0.36781759
8       train           car             6.31 0.48966641
9   telephone communication             7.50 0.09927349
10 television         radio             6.77 0.66331449
11      media         radio             7.42 0.27871866
12       drug         abuse             6.85 0.94550875
13      bread        butter             6.19 0.62764941
14   cucumber        potato             5.92 0.00000000
15     doctor         nurse             7.00 0.04455693

95. WordSimilarity-353での評価

94で作ったデータを用い，各モデルが出力する類似度のランキングと，人間の類似度判定のランキングの間のスピアマン相関係数を計算せよ．

# word2vec
word2vec_sim %>%
  dplyr::select(similarity_score, similarity) %>%
  cor(method = "spearman")

                 similarity_score similarity
similarity_score        1.0000000  0.5608885
similarity              0.5608885  1.0000000

# LSI
wordsence_sim %>%
  dplyr::select(similarity_score, similarity) %>%
  cor(method = "spearman")

                 similarity_score similarity
similarity_score        1.0000000  0.1997545
similarity              0.1997545  1.0000000

96. 国名に関するベクトルの抽出

word2vecの学習結果から，国名に関するベクトルのみを抜き出せ．

# {countrycode}を使って「81.」と同じことをする
coutry_name <- stringr::str_replace(
  string = countrycode_data$country.name[
    !is.na(countrycode_data$cowc) & !is.na(countrycode_data$region)
  ], 
  pattern = "\\(.*\\)", replacement = ""
)

replace_coutry_name <- stringr::str_replace_all(
  string = coutry_name, pattern = " ", replacement = "_"
)
names(replace_coutry_name) <- coutry_name

word2vec_content_word <- rownames(x = wordvec_matrix)
word2vec_country <- wordvec_matrix[
  is.element(el = word2vec_content_word, set = replace_coutry_name),
]

# 国名表示
head(x = rownames(word2vec_country), n = 20)

 [1] "United_States"  "France"         "India"          "Canada"        
 [5] "Germany"        "Australia"      "China"          "Japan"         
 [9] "Ireland"        "Italy"          "Israel"         "Mexico"        
[13] "New_Zealand"    "Spain"          "United_Kingdom" "Georgia"       
[17] "Poland"         "Sweden"         "South_Africa"   "Portugal"

97. k-meansクラスタリング

96の単語ベクトルに対して，k-meansクラスタリングをクラスタ数k=5として実行せよ．

SET_CLUSTER_PARAM <- list(
  CLUSTER_NUM = 5
)


# K-Means Clustering
kmean_res <- kmeans(x = word2vec_country, centers = SET_CLUSTER_PARAM$CLUSTER_NUM)
cluster_res <- dplyr::data_frame(
  kmean = as.integer(kmean_res$cluster),
  country = names(kmean_res$cluster)
)
cluster_res %>%
  dplyr::group_by(kmean) %>%
  tidyr::nest(country) %>%
  as.data.frame()

  kmean
1     1
2     2
3     3
4     4
5     5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 country
1 Jordan, Chad, Monaco, Seychelles, Honduras, Dominican_Republic, Somalia, Bahamas, Solomon_Islands, Papua_New_Guinea, Luxembourg, Tunisia, Paraguay, Yemen, Malawi, Zambia, Nicaragua, Kazakhstan, Liberia, Madagascar, Marshall_Islands, Qatar, Tonga, Mali, El_Salvador, Bosnia_and_Herzegovina, Namibia, Botswana, Mauritius, United_Arab_Emirates, Kuwait, Maldives, Bahrain, Sierra_Leone, Oman, Cameroon, Grenada, Belize, Kyrgyzstan, Nauru, Palau, Tajikistan, Uzbekistan, Russian_Federation, Bhutan, Barbados, Benin, Rwanda, Burkina_Faso, Timor-Leste, Myanmar, Gabon, Andorra, Mauritania, Trinidad_and_Tobago, Vanuatu, Guyana, Liechtenstein, Kiribati, Saint_Lucia, Dominica, Eritrea, Tuvalu, San_Marino, Djibouti, Guinea-Bissau, Suriname, Burundi, Lesotho, Turkmenistan, Gambia, Swaziland, Equatorial_Guinea, Togo, Viet_Nam, Central_African_Republic, Comoros, Antigua_and_Barbuda, South_Sudan
2                                                                                                                                                                                                                                                                                                                                                                                                                                                             Georgia, Malaysia, Ukraine, Nigeria, Morocco, Bangladesh, Peru, Albania, Sri_Lanka, Azerbaijan, Angola, Croatia, Malta, Guinea, Libya, Ghana, Lithuania, Kenya, Lebanon, Cyprus, Jamaica, Algeria, Iceland, Czech_Republic, Saudi_Arabia, Sudan, Ethiopia, Cambodia, Nepal, Uganda, Uruguay, Estonia, Zimbabwe, Slovakia, Haiti, Congo, Senegal, Samoa, Montenegro, Mongolia, Guatemala, Latvia, Mozambique, Costa_Rica, Ecuador, Niger, Slovenia, Belarus
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           United_States, Israel, Mexico, Pakistan, Philippines, Singapore, Iraq, Panama, Indonesia, Afghanistan, Cuba, Thailand, Colombia, Chile, Fiji
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             France, Germany, Italy, Spain, Poland, Sweden, South_Africa, Portugal, Netherlands, Brazil, Greece, Egypt, Switzerland, Hungary, Austria, Norway, Denmark, Belgium, Romania, Turkey, Argentina, Finland, Bulgaria, Armenia
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           India, Canada, Australia, China, Japan, Ireland, New_Zealand, United_Kingdom

98. Ward法によるクラスタリング

96の単語ベクトルに対して，Ward法による階層型クラスタリングを実行せよ．さらに，クラスタリング結果をデンドログラムとして可視化せよ．

# Hierarchical Clustering
hclust_res <- hclust(d = dist(x = word2vec_country), method = "ward.D2")
networkD3::dendroNetwork(
  hc = hclust_res, 
  width = 850, height = 3000, zoom = TRUE,
  linkType = "elbow", treeOrientation = "horizontal"
)

# クラスタリング結果を追記
cluster_res <- dplyr::left_join(
  x = cluster_res,
  y = dplyr::data_frame(
    country = names(cutree(tree = hclust_res, k = SET_CLUSTER_PARAM$CLUSTER_NUM)),
    hclust = as.integer(cutree(tree = hclust_res, k = SET_CLUSTER_PARAM$CLUSTER_NUM))
  ),
  by = c("country")
)

# kmeansとhclustの結果の対応
# 南米、中東の辺りで結果に差異が見られる？
table(cluster_res$kmean, cluster_res$hclust)

   
     1  2  3  4  5
  1  0  0  0  1 78
  2  0  1  0 36 11
  3  4  1  0 10  0
  4  0 22  0  2  0
  5  3  0  5  0  0

SET_GEO_JSON <- "https://raw.githubusercontent.com/datasets/geo-boundaries-world-110m/master/countries.geojson"

# プロット用にデータ取得 （マッチ数がこちらの方が多い）
coutry_location <- dplyr::bind_rows(
  x = lapply(
    X = rownames(word2vec_country), 
    FUN = ggmap::geocode,
    messaging = FALSE, output = "more"
  )
) %>% 
  dplyr::select(country, lon, lat, address, query)

Warning: geocode failed with status ZERO_RESULTS, location = "Two_Sicilies"

Warning: geocode failed with status ZERO_RESULTS, location =
"German_Democratic_Republic"

Warning in rbind_all(x): Unequal factor levels: coercing to character

Warning in rbind_all(x): Unequal factor levels: coercing to character

Warning in rbind_all(x): Unequal factor levels: coercing to character

# 各国のポリゴンデータ取得 
download.file(
  url = SET_GEO_JSON, 
  destfile = basename(path = SET_GEO_JSON)
)
countries_geo <- rgdal::readOGR(dsn = basename(path = SET_GEO_JSON), layer = "OGRGeoJSON")

OGR data source with driver: GeoJSON 
Source: "countries.geojson", layer: "OGRGeoJSON"
with 177 features
It has 63 fields

countries_geo$admin <- stringr::str_to_lower(string = as.character(countries_geo$admin))

# ポリゴンデータにクラスタリング結果を追加して、クラスタ毎に色分けしてプロット
plotClusterChoropleth <- function (
  coutry_polygons, coutry_cluster, cluster_num,
  plot_param = list(
    width = 900, height = 600,
    smooth_factor = 0.2, fill_opacity = 0.5
  )
) {
  
  coutry_polygons$cluster <- factor(x = coutry_cluster)
  cluster_factor_pal <- leaflet::colorFactor(
    palette = topo.colors(n = cluster_num),
    domain = seq(from = 1, to = cluster_num)
  )

  coutry_polygons %>% 
    leaflet::leaflet(
      data = ., 
      width = plot_param$width, height = plot_param$height
    ) %>%
    leaflet::setView(lng = 60, lat = 0, zoom = 1) %>%
    leaflet::addTiles() %>% 
    leaflet::addPolygons(
      stroke = FALSE, 
      smoothFactor = plot_param$smooth_factor, fillOpacity = plot_param$fill_opacity,
      color = ~cluster_factor_pal(cluster)
    )
}


# 国名のベクトルをクラスタリングした結果毎にマーカーを色分け
color_pal <- leaflet::colorFactor(
  palette = topo.colors(n = SET_CLUSTER_PARAM$CLUSTER_NUM),
  domain = seq(from = 1, to = SET_CLUSTER_PARAM$CLUSTER_NUM)
)
cluster_loc <- dplyr::left_join(
  x = cluster_res,
  y = coutry_location %>% 
    dplyr::select(-country),
  by = c("country" = "query")
) %>%
 na.omit(.)


# kmeans
cluster_loc %>%
  leaflet::leaflet(data = ., width = 900, height = 600) %>% 
  leaflet::setView(lng = 60, lat = 0, zoom = 1) %>%
  leaflet::addTiles() %>% 
  leaflet::addCircleMarkers(
    lng = ~lon, lat = ~lat,
    radius = 10, stroke = TRUE,
    color = ~color_pal(kmean), group = ~kmean
  )

# hclust
cluster_loc %>%
  leaflet::leaflet(data = ., width = 900, height = 600) %>% 
  leaflet::setView(lng = 60, lat = 0, zoom = 1) %>%
  leaflet::addTiles() %>% 
  leaflet::addCircleMarkers(
    lng = ~lon, lat = ~lat,
    radius = 10, stroke = TRUE,
    color = ~color_pal(hclust), group = ~hclust
  )

# kmeansでは日本とオーストラリア、カナダ、ニュージーランドは同じクラスタ
# hclustでは日本は異なるクラスタで、オーストラリア、カナダ、ニュージーランドが同じクラスタで


# 国別のクラスタリング結果でコロプレス地図
mapplot_cluster <- cluster_res %>%
  dplyr::mutate(
    country = stringr::str_to_lower(
      string = stringr::str_replace_all(
        string = .$country, pattern = "_", replacement = " "
      )
    )
  ) %>%
  dplyr::filter(is.element(set = countries_geo$admin, el = .$country))

# ポリゴンデータとクラスタリング結果の名寄せが失敗した国の数
sum(!is.element(el = countries_geo$admin, set = mapplot_cluster$country))

[1] 33

# ポリゴンとクラスタリング結果の両方がマッチする国のみに限定
countries_geo <- countries_geo[
  is.element(el = countries_geo$admin, set = mapplot_cluster$country),
]
# 一致するようになる
sum(!is.element(el = countries_geo$admin, set = mapplot_cluster$country))

[1] 0

# ポリゴンデータとクラスタリング結果を対応づけるためにソート
countries_geo <- countries_geo[order(countries_geo$admin), ]


# kmeans
plotClusterChoropleth(
  coutry_polygons = countries_geo,
  coutry_cluster = mapplot_cluster$kmean[order(mapplot_cluster$country)], 
  cluster_num = SET_CLUSTER_PARAM$CLUSTER_NUM,
  plot_param = list(
    width = 900, height = 600,
    smooth_factor = 0.2, fill_opacity = 0.5
  )
)

# hclust
plotClusterChoropleth(
  coutry_polygons = countries_geo,
  coutry_cluster = mapplot_cluster$hclust[order(mapplot_cluster$country)], 
  cluster_num = SET_CLUSTER_PARAM$CLUSTER_NUM,
  plot_param = list(
    width = 900, height = 600,
    smooth_factor = 0.2, fill_opacity = 0.5
  )
)

99. t-SNEによる可視化

96の単語ベクトルに対して，ベクトル空間をt-SNEで可視化せよ．

# t-SNEでプロット用に3次元に
word2vec_country_tsne <- tsne::tsne(
  X = word2vec_country, k = 3, 
  max_iter = 1000, epoch = 1001
)

sigma summary: Min. : 0.2712 |1st Qu. : 0.469 |Median : 0.5136 |Mean : 0.5157 |3rd Qu. : 0.5573 |Max. : 0.6918 |

colnames(x = word2vec_country_tsne) <- c("x", "y", "z")
cluster_res <- dplyr::bind_cols(
  cluster_res, 
  data.frame(word2vec_country_tsne)
) %>% 
  print

Source: local data frame [174 x 6]

   kmean       country hclust           x          y          z
1      3 United_States      1 -12.7453644 -10.565626  11.654089
2      4        France      2  -7.0296882 -19.211345   3.459901
3      5         India      1   4.3163778  -7.167451  23.182763
4      5        Canada      3   9.0122075  12.968094  19.410686
5      4       Germany      2 -10.3005859 -12.165826 -16.071274
6      5     Australia      3 -16.0925092  25.774982   1.387155
7      5         China      1   0.9276603 -15.911632 -17.921678
8      5         Japan      1 -20.1800474   2.090080 -13.681249
9      5       Ireland      3  14.4176810 -20.754370 -15.478812
10     4         Italy      2  -5.0062490  24.958916   3.814722
..   ...           ...    ...         ...        ...        ...

# 色はkmeansによるクラスタリング結果
# （「形はhclustによるクラスタリング結果」にしたいが未実装）
# 位置がt-SNEによる次元圧縮の結果
threejs::scatterplot3js(
  x = cluster_res$x, y = cluster_res$y, z = cluster_res$z,
  axis = FALSE, grid = FALSE,
  color = rainbow(n = length(unique(cluster_res$kmean)))[cluster_res$kmean],
  # Not yet used
  pch = cluster_res$hclust,
  labels = cluster_res$country,
  width = 900, height = 600, size = 1.00
)

所感

言語処理100本ノック(2015年版)のベクトル空間法 (2)の章をやってみました（今回はあらかじめコマンドラインで作成した学習結果を読み込みました）。
word2vecのC実装をラップしたRパッケージに{tmcn.word2vec}がありますが、各種パラメータの指定ができない点と、繰り返して類似度計算するとフリーズしてしまう症状に見舞われたため（原因までは未調査）、使用しておりません。
https://r-forge.r-project.org/R/?group_id=1571
word2vecを始めとするDeep Learning関係のライブラリはPython実装が多いので、RにこだわらずPythonを使うといいと思います。ただ、どうしてもR上で処理したいのであれば{PythonInR}でPythonを記述する方法もあります。 http://yamano357.hatenadiary.com/entry/2015/10/13/033135
　- word2vecよりも性能がよいというGloveも、{PythonInR}を使って動作しました。
課題とは関係ないですが、国名に関するベクトルをクラスタリングした結果を{leaflet}で地図にプロットしてみました。kmeansとhclustの結果で割り振られる番号を指定できなくて、配色が変わってしまうのどうにか固定したかったです（クラスタ結果がふたつの方法で同じでも、振られる番号が違うと異なる色になってしまう）。
https://rstudio.github.io/leaflet/colors.html
なお、高次元データを可視化する方法にはt-SNE以外にもいくつかあるので、いろいろなデータで試してみると意外な結果が見えるかもしません。
　高次元データの可視化の手法をSwiss rollを例に見てみよう
個人的には、コマンドライン実行せずにRでword2vecをさせたかったです。

実行環境

library(devtools)
devtools::session_info()

Session info --------------------------------------------------------------

 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, darwin13.4.0        
 ui       X11                         
 language (EN)                        
 collate  ja_JP.UTF-8                 
 tz       Asia/Tokyo

Packages ------------------------------------------------------------------

 package     * version     date      
 assertthat  * 0.1         2013-12-06
 base64enc     0.1-3       2015-07-28
 colorspace    1.2-6       2015-03-11
 countrycode * 0.18        2014-12-29
 crayon        1.3.0       2015-06-05
 curl          0.9         2015-06-19
 DBI           0.3.1       2014-09-24
 devtools    * 1.8.0       2015-05-09
 digest        0.6.8       2014-12-31
 dplyr       * 0.4.2.9002  2015-07-25
 evaluate      0.7         2015-04-21
 formatR       1.2         2015-04-21
 geosphere     1.3-13      2015-04-11
 ggmap       * 2.4         2015-03-05
 ggplot2     * 1.0.1       2015-03-17
 git2r         0.10.1      2015-05-07
 gtable        0.1.2       2012-12-05
 hadleyverse * 0.1         2015-08-09
 haven       * 0.2.0       2015-04-09
 htmltools     0.2.6       2014-09-08
 htmlwidgets   0.5.1       2015-07-25
 jpeg          0.1-8       2014-01-23
 jsonlite      0.9.16      2015-04-11
 knitr       * 1.10.5      2015-05-06
 lattice       0.20-33     2015-07-14
 lazyeval    * 0.1.10.9000 2015-07-25
 leaflet     * 1.0.0.9999  2015-07-25
 lubridate   * 1.3.3       2013-12-31
 magrittr      1.5         2014-11-22
 mapproj       1.2-2       2014-01-16
 maps          2.3-9       2014-09-22
 MASS          7.3-43      2015-07-16
 Matrix      * 1.2-2       2015-07-08
 memoise       0.2.1       2014-04-22
 munsell       0.4.2       2013-07-11
 networkD3   * 0.2.1       2015-08-25
 plyr        * 1.8.3       2015-06-12
 png           0.1-7       2013-12-03
 proto         0.3-10      2012-12-22
 R6            2.1.1       2015-08-19
 Rcpp          0.12.0      2015-07-26
 readr       * 0.1.1.9000  2015-07-25
 readxl      * 0.1.0       2015-04-14
 reshape2      1.4.1       2014-12-06
 rgdal       * 1.0-4       2015-06-23
 RgoogleMaps   1.2.0.7     2015-01-21
 rjson         0.2.15      2014-11-03
 RJSONIO       1.3-0       2014-07-28
 rmarkdown     0.7         2015-06-13
 rstudioapi    0.3.1       2015-04-07
 rversions     1.0.1       2015-06-06
 scales        0.2.5       2015-06-12
 sp          * 1.1-1       2015-06-05
 stringi     * 0.5-5       2015-06-29
 stringr     * 1.0.0.9000  2015-07-25
 testthat    * 0.10.0      2015-05-22
 threejs       0.2.2       2015-10-13
 tidyr       * 0.2.0.9000  2015-07-25
 tsne        * 0.1-2       2012-05-02
 xml2        * 0.1.1       2015-06-02
 yaml          2.1.13      2014-06-12
 source                               
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.1)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/dplyr@75e8303)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (aaboyles/hadleyverse@16532fe)
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (ramnathv/htmlwidgets@e153784)
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.2)                       
 Github (hadley/lazyeval@ecb8dc0)     
 Github (rstudio/leaflet@451fc78)     
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.2)                       
 CRAN (R 3.2.2)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (RcppCore/Rcpp@6ae91cc)       
 Github (hadley/readr@f4a3956)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 Github (hadley/stringr@380c88f)      
 CRAN (R 3.2.0)                       
 Github (bwlewis/rthreejs@6e186f4)    
 Github (hadley/tidyr@0dc87b2)        
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)                       
 CRAN (R 3.2.0)

言語処理100本ノック第10章:ベクトル空間法 (2)

@yamano357

2015年10月14日

概要

Rコード

パッケージ読み込み

事前準備

90. word2vecによる学習

91. アナロジーデータの準備

92. アナロジーデータへの適用

93. アナロジータスクの正解率の計算

94. WordSimilarity-353での類似度計算

95. WordSimilarity-353での評価

96. 国名に関するベクトルの抽出

97. k-meansクラスタリング

98. Ward法によるクラスタリング

99. t-SNEによる可視化

所感

実行環境

言語処理100本ノック 第10章:ベクトル空間法 (2)

@yamano357

2015年10月14日

概要

Rコード

パッケージ読み込み

事前準備

90. word2vecによる学習

91. アナロジーデータの準備

92. アナロジーデータへの適用

93. アナロジータスクの正解率の計算

94. WordSimilarity-353での類似度計算

95. WordSimilarity-353での評価

96. 国名に関するベクトルの抽出

97. k-meansクラスタリング

98. Ward法によるクラスタリング

99. t-SNEによる可視化

所感

実行環境

言語処理100本ノック第10章:ベクトル空間法 (2)