Lecture 5: 類似度計算

前回の課題

課題１：getFreqMtxDir()関数の作成

ディレクトリ名，頻度タイプ（オプション引数）の２つの変数を引数とし、オプション引数値により素頻度行列、相対頻度行列を返す関数を作成しなさい。

ただし、オプション引数のデフォルトは、素頻度行列が結果として出力されること。

ポイント

getFreqMtxDir()関数の中で、getRawFreqMtx2()を呼び込む

getFreqMtxDir.R

getFreqMtxDir <- function(dirname, rela = FALSE) {

    files <- list.files(dirname)
    filesDir <- unlist(lapply(dirname, paste, files, sep = "/"))

    source("getFreqMtx2.R")

    # freqLst <- lapply(filesDir, function(x) getFreqMtx2(x, relative))
    freqLst <- lapply(filesDir, getFreqMtx2, relative = rela)

    mtx <- freqLst[[1]]
    for (i in freqLst[-1]) mtx <- merge(mtx, i, all = T, by = "term")

    mtx[is.na(mtx)] <- 0
    mtx <- mtx[order(as.vector(mtx$term)), ]

    row.names(mtx) <- mtx[, 1]
    mtx <- mtx[-1]

    return(mtx)
}

補足:第２引数の適用

freqLst <- lapply(filesDir, function(x) getFreqMtx2(x, relative))

freqLst <- lapply(filesDir, getFreqMtx2, relative = rela)

実行例１

source("getFreqMtxDir.R")
res <- getFreqMtxDir("../univ")
res[1:5, ]

##     hiroshima kufs kyoto osaka osakaNew tokyo waseda
## 000         0    0     0     0        0     0      2
## 1           0    0     0     1        0     0      1
## 10          0    0     1     0        0     0      1
## 100         0    0     0     0        1     0      0
## 11          2    0     0     0        0     0      0

実行例２

source("getFreqMtxDir.R")
res <- getFreqMtxDir("../univ", rela = TRUE)
res[1:5, ]

##     hiroshima kufs kyoto osaka osakaNew tokyo waseda
## 000     0.000    0 0.000 0.000    0.000     0  0.002
## 1       0.000    0 0.000 0.003    0.000     0  0.001
## 10      0.000    0 0.001 0.000    0.000     0  0.001
## 100     0.000    0 0.000 0.000    0.002     0  0.000
## 11      0.011    0 0.000 0.000    0.000     0  0.000

課題２：実データでの関数の実行

univフォルダに含まれる6つの大学の挨拶文に関して、課題１で作成した関数を使用して、粗頻度行列を作成して、csv形式で出力しなさい。ただし、粗頻度は、行の頻度総数が50以上の単語に限定すること。

実行例

source("getFreqMtxDir.R")
res <- getFreqMtxDir("../univ")
res1 <- res[rowSums(res) >= 50, ]
res1

##            hiroshima kufs kyoto osaka osakaNew tokyo waseda
## a                  4   12    21     8       14    10     24
## and                7   18    38    15       26    22     37
## in                 4   20    26     4       11     4     32
## of                 5   18    35    10       26    38     34
## the                7   25    36    16       31    52     39
## to                11   17    26    11       17    15     35
## university         6    5    16    18       21     9     22

apply関数

apply(data, margin, function): 行、列の操作をまとめて行う関数

 行操作 ex. rowSums(x) == apply(x, 1, sum)
 列操作 ex. colSums(x) == apply(x, 2, sum)
 各要素の平方根＝apply(x,c(1,2),sqrt)

サンプル行列の作成

tmp <- matrix(1:6, nrow = 2, ncol = 3)

行操作

rowSums(tmp)

## [1]  9 12

apply(tmp, 1, sum)

## [1]  9 12

列操作

colSums(tmp)

## [1]  3  7 11

apply(tmp, 2, sum)

## [1]  3  7 11

要素操作

apply(tmp, c(1, 2), sqrt)

##       [,1]  [,2]  [,3]
## [1,] 1.000 1.732 2.236
## [2,] 1.414 2.000 2.449

apply(tmp, c(1, 2), function(x) x^2)

##      [,1] [,2] [,3]
## [1,]    1    9   25
## [2,]    4   16   36

apply(tmp, 2, function(tmp) tmp/sum(tmp))

##        [,1]   [,2]   [,3]
## [1,] 0.3333 0.4286 0.4545
## [2,] 0.6667 0.5714 0.5455

getFreqMtxDir.Rの応用例

getRawFreqMtx2()のrelative引数を使用しないで、素頻度行列からまとめて計算する方法

getFreqMtxDir.R

getFreqMtxDir <- function(dirname, relative=FALSE)
{

  files <- list.files(dirname)
  filesDir <- unlist(lapply(dirname, paste, files, sep = "/"))

  source("getFreqMtx2.R")
  freqLst <- lapply(filesDir, getFreqMtx2)

  (中略)

  mtx <- mtx[-1]

  if (relative==TRUE){
    mtx <- apply(mtx,2,function(mtx) round(mtx/sum(mtx))
  }

  return(mtx)
}

頻度数の重み付け

Term Frequency-Inverse Document Frequency

複数のテキストに共通して出現する単語の低く評価

TF-IDF 1

\[ w=tf*log(\frac{N}{df}) \]

tf: term frequency

df: document frequency

tf値（頻度数）を計算

tf <- getFreqMtxDir("../testData")
tf

##   test1 test2 test3
## a     3     2     2
## b     4     4     0
## c    13     2     3
## d     0     0     1
## e     7     1     1
## f     0    11     9
## g     0     7     7
## h     0     0     4

N値（ドキュメント数）を計算

N <- ncol(tf)
N

## [1] 3

df値（１行あたり０以上の要素数）を計算

df <- apply(tf, 1, function(x) length(x[x > 0]))
df

## a b c d e f g h 
## 3 2 3 1 3 2 2 1

tf-idf値を計算

w <- tf * log(N/df)
w

##   test1 test2 test3
## a 0.000 0.000 0.000
## b 1.622 1.622 0.000
## c 0.000 0.000 0.000
## d 0.000 0.000 1.099
## e 0.000 0.000 0.000
## f 0.000 4.460 3.649
## g 0.000 2.838 2.838
## h 0.000 0.000 4.394

TF-IDF 2

共通出現頻度も考慮したtf-idf式

\[ w=tf*(log(\frac{N}{df})+1) \]

tf-idf値を計算

w <- tf * (log(N/df) + 1)
w

##    test1  test2  test3
## a  3.000  2.000  2.000
## b  5.622  5.622  0.000
## c 13.000  2.000  3.000
## d  0.000  0.000  2.099
## e  7.000  1.000  1.000
## f  0.000 15.460 12.649
## g  0.000  9.838  9.838
## h  0.000  0.000  8.394

getFreqMtxDirを拡張

tf-idfの計算用の関数を作成：calcTFIDF

calcTFIDF <- function(tf, type = 1) {

    N <- ncol(tf)

    idf <- apply(tf, 1, function(x) length(x[x > 0]))

    if (type == 1) {
        w <- tf * log(N/idf)
    } else if (type == 2) {
        w <- tf * (log(N/idf) + 1)
    }

    return(w)
}

tf-idfのオプションを追加

getFreqMtxDir <- function(dirname, rela = FALSE, tfidf = 0) {

    files <- list.files(dirname)
    filesDir <- unlist(lapply(dirname, paste, files, sep = "/"))

    source("getFreqMtx2.R")
    # freqLst <- lapply(filesDir, function(x) getFreqMtx2(x, relative))
    # freqLst <- lapply(filesDir, getFreqMtx2, relative=rela)
    freqLst <- lapply(filesDir, getFreqMtx2)

    mtx <- freqLst[[1]]
    for (i in freqLst[-1]) mtx <- merge(mtx, i, all = T, by = "term")

    mtx[is.na(mtx)] <- 0
    mtx <- mtx[order(as.vector(mtx$term)), ]

    row.names(mtx) <- mtx[, 1]
    mtx <- mtx[-1]

    if (rela == TRUE) {
        mtx <- apply(mtx, 2, function(mtx) round(mtx/sum(mtx), 3))
    }

    if (tfidf == 1 || tfidf == 2) {
        mtx <- calcTFIDF(mtx, tfidf)
    }
    return(mtx)
}

実行例

source("getFreqMtxDir.R")
res <- getFreqMtxDir("../testData", tfidf = 1)
round(res, 2)

##   test1 test2 test3
## a  0.00  0.00  0.00
## b  1.62  1.62  0.00
## c  0.00  0.00  0.00
## d  0.00  0.00  1.10
## e  0.00  0.00  0.00
## f  0.00  4.46  3.65
## g  0.00  2.84  2.84
## h  0.00  0.00  4.39

相関係数

ピアソン積率相関係数

\[ Corr(x,y)= \frac{\sum (x_{i}-\overline{x}) (y_{i}-\overline{y})}{\sqrt{\sum (x_{i}-\overline{x})^2\sum (y_{i}-\overline{y})^2}} \]

相関係数行列（テキスト間）

tf <- getFreqMtxDir("../testData")
cor(tf)

##         test1   test2   test3
## test1  1.0000 -0.2876 -0.3798
## test2 -0.2876  1.0000  0.7967
## test3 -0.3798  0.7967  1.0000

相関係数行列（単語間）

行と列を転置する

t(tf)

##       a b  c d e  f g h
## test1 3 4 13 0 7  0 0 0
## test2 2 4  2 0 1 11 7 0
## test3 2 0  3 1 1  9 7 4

round(cor(t(tf)), 2)

##       a     b     c     d     e     f     g     h
## a  1.00  0.50  1.00 -0.50  1.00 -0.99 -1.00 -0.50
## b  0.50  1.00  0.43 -1.00  0.50 -0.34 -0.50 -1.00
## c  1.00  0.43  1.00 -0.43  1.00 -1.00 -1.00 -0.43
## d -0.50 -1.00 -0.43  1.00 -0.50  0.34  0.50  1.00
## e  1.00  0.50  1.00 -0.50  1.00 -0.99 -1.00 -0.50
## f -0.99 -0.34 -1.00  0.34 -0.99  1.00  0.99  0.34
## g -1.00 -0.50 -1.00  0.50 -1.00  0.99  1.00  0.50
## h -0.50 -1.00 -0.43  1.00 -0.50  0.34  0.50  1.00

# res1

散布図

plot(tf[, 1], tf[, 2], type = "n")
text(tf[, 1], tf[, 2], rownames(tf))
cor(tf[, 1], tf[, 2])

## [1] -0.2876

mtext(paste("corr = ", round(cor(tf[, 1], tf[, 2]), 2)), side = 3)

plot of chunk unnamed-chunk-22

おまけ

library(manipulate)

manipulate({
    plot(tf[, x], tf[, y], type = "n")
    text(tf[, x], tf[, y], rownames(tf))
    mtext(paste("corr = ", round(cor(tf[, x], tf[, y]), 2)), side = 3)
}, x = picker(1, 2, 3), y = picker(1, 2, 3))

共分散係数

\[ Cov(x,y)= \frac{\sum (x_{i}-\overline{x}) (y_{i}-\overline{y})}{n} \]

相関係数行列（テキスト間）

tf <- getFreqMtxDir("../testData")
cov(tf)

##        test1  test2  test3
## test1 21.696 -5.161 -5.589
## test2 -5.161 14.839  9.696
## test3 -5.589  9.696  9.982

proxyパッケージによる類似度計算

proxyの読み込み

library(proxy)

## Attaching package: 'proxy'

## The following object(s) are masked from 'package:stats':
## 
## as.dist, dist

変数間の相関行列

行と列を転置する

simil(t(tf))

##         test1   test2
## test2 -0.2876        
## test3 -0.3798  0.7967

変数間の相関行列（対角行列で結果表示）

simil(t(tf), diag = T)

##         test1   test2   test3
## test1      NA                
## test2 -0.2876      NA        
## test3 -0.3798  0.7967      NA

単語間の相関行列

simil(tf)

##        a      b      c      d      e      f      g
## b 0.8397                                          
## c 0.7066 0.5975                                   
## d 0.8254 0.7392 0.5320                            
## e 0.8301 0.7951 0.7418 0.7902                     
## f 0.3911 0.3520 0.1717 0.3704 0.2212              
## g 0.5864 0.5473 0.3670 0.5657 0.4165 0.8047       
## h 0.7884 0.6281 0.5690 0.8889 0.6791 0.4815 0.6768

テキスト間のコサイン類似度

\[ Cos(x,y)= \frac{\sum x_{i} y_{i}}{\sqrt{\sum x_{i}^2\sum y_{i}^2}} \]

simil(t(tf), method = "cosine")

##        test1  test2
## test2 0.2527       
## test3 0.2629 0.8974

行列の条件抽出

頻度行列取得

source("getFreqMtxDir.R")
res <- getFreqMtxDir("../univ")
res[1:3, ]

##     hiroshima kufs kyoto osaka osakaNew tokyo waseda
## 000         0    0     0     0        0     0      2
## 1           0    0     0     1        0     0      1
## 10          0    0     1     0        0     0      1

“000"の行を抽出

res[rownames(res) == "000", ]

##     hiroshima kufs kyoto osaka osakaNew tokyo waseda
## 000         0    0     0     0        0     0      2

"000"と"10"の行を抽出

res[(rownames(res) == "000") | (rownames(res) == "10"), ]

##     hiroshima kufs kyoto osaka osakaNew tokyo waseda
## 000         0    0     0     0        0     0      2
## 10          0    0     1     0        0     0      1

res[rownames(res) %in% c("000", "10"), ]

##     hiroshima kufs kyoto osaka osakaNew tokyo waseda
## 000         0    0     0     0        0     0      2
## 10          0    0     1     0        0     0      1

特定リストに含まれる行を抽出

noise <- c("a", "an", "the")
res[rownames(res) %in% noise, ]

##     hiroshima kufs kyoto osaka osakaNew tokyo waseda
## a           4   12    21     8       14    10     24
## an          0    1     7     2        1     0      5
## the         7   25    36    16       31    52     39

特定リストに含まれない行を抽出

res[!rownames(res) %in% noise, ]

grep & 正規表現を使用

最初の文字が数字で始まるものを抽出

tmp <- res[grep(rownames(res), pattern = "^[[:digit:]]"), ]
rownames(tmp)

##  [1] "000"   "1"     "10"    "100"   "11"    "12"    "125th" "1893" 
##  [9] "1897"  "1931"  "2"     "2004"  "2007"  "2010"  "2012"  "2013" 
## [17] "21"    "21st"  "22nd"  "252"   "3"     "4"     "5"     "560"  
## [25] "600"   "7"     "700"

res[rownames(res) %in% rownames(tmp), ]

##       hiroshima kufs kyoto osaka osakaNew tokyo waseda
## 000           0    0     0     0        0     0      2
## 1             0    0     0     1        0     0      1
## 10            0    0     1     0        0     0      1
## 100           0    0     0     0        1     0      0
## 11            2    0     0     0        0     0      0
## 12            1    0     0     0        0     0      0
## 125th         0    0     0     0        0     0      1
## 1893          0    0     0     0        0     0      1
## 1897          0    0     1     0        0     0      0
## 1931          0    0     0     0        1     0      0
## 2             0    0     0     1        0     0      0
## 2004          0    0     1     0        1     0      0
## 2007          0    0     0     0        1     0      1
## 2010          0    0     0     0        0     0      1
## 2012          0    0     0     0        0     0      1
## 2013          0    0     0     0        0     0      1
## 21            0    0     0     0        0     0      1
## 21st          0    0     0     0        1     0      0
## 22nd          0    0     0     0        1     0      0
## 252           1    0     0     0        0     0      0
## 3             0    0     0     1        0     0      0
## 4             0    0     0     1        0     0      1
## 5             0    0     0     1        0     0      0
## 560           0    0     0     0        0     0      1
## 600           0    0     0     0        0     0      1
## 7             0    0     1     0        0     0      0
## 700           0    0     0     0        0     0      1

最初の文字が数字で始まるものを排除

res[!rownames(res) %in% rownames(tmp), ]

最初の文字が数字で始まるものを排除（その２）

tmp <- res[grep(rownames(res), pattern = "^[[:alpha:]]"), ]
rownames(tmp)
res[rownames(res) %in% rownames(tmp), ]

今日の課題(締め切り11月26日)

以下４つの引数をもち、テキストの類似行列を結果出力するcalcSmi関数を作成しなさい。

第1引数[ディレクトリ名]：dirname（必須）

第2引数[頻度タイプ]：relative（オプション, 初期値=FALSE）

relative=FALSE (raw frequency)
relative=TRUE (relative frequency)

第3引数[TF-IDFの重み]：tfidf（オプション, 初期値=0）

tfidf=0 (なにもしない)
tfidf=1 (tf-idf1)
tfidf=2 (tf-idf2)

第4引数

類似度計算法：simi（オプション, 初期値=”corr”）

数値は２桁で表示すること

simi=”corr” (相関係数)
simi=”cov” (共分散係数)
simi=”cosine” (コサイン類似度)

Lecture 5: 類似度計算

前回の課題

課題１：getFreqMtxDir()関数の作成

ディレクトリ名，頻度タイプ（オプション引数）の２つの変数を引数とし、オプション引数値により素頻度行列、相対頻度行列を返す関数を作成しなさい。

ポイント

getFreqMtxDir.R

補足:第２引数の適用

実行例１

実行例２

課題２：実データでの関数の実行

univフォルダに含まれる6つの大学の挨拶文に関して、課題１で作成した関数を使用して、粗頻度行列を作成して、csv形式で出力しなさい。ただし、粗頻度は、行の頻度総数が50以上の単語に限定すること。

実行例

apply関数

apply(data, margin, function): 行、列の操作をまとめて行う関数

サンプル行列の作成

行操作

列操作

要素操作

getFreqMtxDir.Rの応用例

getRawFreqMtx2()のrelative引数を使用しないで、素頻度行列からまとめて計算する方法

getFreqMtxDir.R

頻度数の重み付け

Term Frequency-Inverse Document Frequency

TF-IDF 1

tf: term frequency

df: document frequency

tf値（頻度数）を計算

N値（ドキュメント数）を計算

df値（１行あたり０以上の要素数）を計算

tf-idf値を計算

TF-IDF 2

共通出現頻度も考慮したtf-idf式

tf-idf値を計算

getFreqMtxDirを拡張

tf-idfの計算用の関数を作成：calcTFIDF

tf-idfのオプションを追加

実行例

相関係数

ピアソン積率相関係数

相関係数行列（テキスト間）

相関係数行列（単語間）

散布図

おまけ

共分散係数

共分散係数

相関係数行列（テキスト間）

proxyパッケージによる類似度計算

proxyの読み込み

変数間の相関行列

変数間の相関行列（対角行列で結果表示）

単語間の相関行列

テキスト間のコサイン類似度

行列の条件抽出

頻度行列取得

“000"の行を抽出

"000"と"10"の行を抽出

特定リストに含まれる行を抽出

特定リストに含まれない行を抽出

grep & 正規表現を使用

最初の文字が数字で始まるものを抽出

最初の文字が数字で始まるものを排除

最初の文字が数字で始まるものを排除（その２）

今日の課題(締め切り11月26日)

以下４つの引数をもち、テキストの類似行列を結果出力するcalcSmi関数を作成しなさい。

第1引数[ディレクトリ名]：dirname（必須）

第2引数[頻度タイプ]：relative（オプション, 初期値=FALSE）

第3引数[TF-IDFの重み]：tfidf（オプション, 初期値=0）

第4引数

類似度計算法：simi（オプション, 初期値=”corr”）

数値は２桁で表示すること

また作成した関数を利用して、 univディレクトリ内のテキストについて、以下の計算をして、結果をcsvファイルに出力しなさい。

1) 単語の素頻度行列から、コサイン類似度を計算

2) 単語の素頻度行列をTF-IDF1値で重みづけした行列から、コサイン類似度を計算

提出物は、calcSmiのRスクリプトと計算結果のcsvファイルです。