Lecture6: TF-IDF, 類似度計算

getFreq2015関数に変更：文字化け対処

  wordLst <- strsplit(txt, "[[:space:]]|[[:punct:]]")

の部分を以下に変更

  wordLst <- strsplit(txt, "[^a-zA-Z0-9]")

getFreqDir.Rを読み込む

source("getFreqDir.R")

★for文：繰り返し

tmp <- seq(1:10)
tmp

##  [1]  1  2  3  4  5  6  7  8  9 10

tmpの各要素を十倍する

tmp*10

##  [1]  10  20  30  40  50  60  70  80  90 100

tmpの各要素を十倍する

tmp2<-c()
for(i in 1:10) tmp2[i]<-tmp[i]*10
tmp2

##  [1]  10  20  30  40  50  60  70  80  90 100

tmpの各要素を十倍する

tmp3<-lapply(tmp, function(x) x*10)
unlist(tmp3)

##  [1]  10  20  30  40  50  60  70  80  90 100

## ★applyの説明

#### 行列の作成

r mtx<-matrix(1:6, nrow=2, ncol=3)

#### 行ごとに数を足し算する

r apply(mtx,1, sum)

## [1] 9 12

#### 列ごとに数を足し算する

r apply(mtx,2, sum)

## [1] 3 7 11

#### 要素ごとに数を平方根を計算する

r apply(mtx,c(1,2), sqrt)

## [,1] [,2] [,3] ## [1,] 1.000000 1.732051 2.236068 ## [2,] 1.414214 2.000000 2.449490

#### 補足：for文で行ごとに数を足し算する

r for(i in 1:nrow(mtx)){ print(sum(mtx[i,])) }

## [1] 9 ## [1] 12

★頻度数の重み付け

Term Frequency-Inverse Document Frequency

複数のテキストに共通して出現する単語の低く評価 ### TF-IDF 1 \[w=tf*log(\frac{N}{df}) \]

tf: term frequency

df: document frequency

tf値（頻度数）を計算

  tf<-getFreqDir("testData")
  tf

##   test1 test2 test3
## a     3     2     2
## b     4     4     0
## c    13     2     3
## d     0     0     1
## e     7     1     1
## f     0    11     9
## g     0     7     7
## h     0     0     4

N値（ドキュメント数）を計算

  N<-ncol(tf)
  N

## [1] 3

df値（１行あたり０以上の要素数）を計算

  df<-apply(tf, 1, function(x) length(x[x>0]) )
  df

## a b c d e f g h 
## 3 2 3 1 3 2 2 1

tf-idf値を計算

  w<-round(tf*log(N/df),2)
  w

##   test1 test2 test3
## a  0.00  0.00  0.00
## b  1.62  1.62  0.00
## c  0.00  0.00  0.00
## d  0.00  0.00  1.10
## e  0.00  0.00  0.00
## f  0.00  4.46  3.65
## g  0.00  2.84  2.84
## h  0.00  0.00  4.39

TF-IDF 2

共通出現頻度も考慮したtf-idf式

\[w=tf*(log(\frac{N}{df})+1) \]

tf-idf値を計算

  w<-round(tf*(log(N/df)+1),2)
  w

##   test1 test2 test3
## a  3.00  2.00  2.00
## b  5.62  5.62  0.00
## c 13.00  2.00  3.00
## d  0.00  0.00  2.10
## e  7.00  1.00  1.00
## f  0.00 15.46 12.65
## g  0.00  9.84  9.84
## h  0.00  0.00  8.39

getFreqMtxDirを拡張

tf-idfの計算用の関数を作成：calcTFIDF

  calcTFIDF<-function(tf, type=1){
  
    N<-ncol(tf)
  
    idf<-apply(tf, 1, function(x) length(x[x>0]) )
  
    if(type==1) {
      w<-tf*log(N/idf)
    }else if(type==2) {
      w<-tf*(log(N/idf)+1)
    }
  
    return(w)
  }

実習1: getFreqDir関数にcalcTFIDF関数を組み込む

手順1：getFreqDir.RのgetFreqDir関数の下に上記のcalcTFIDF関数を追加

手順2：getFreqDir関数の引数にオプション引数tfidfを追加（デフォルト値0)

手順3：getFreqDir関数内にtfidfのオプションで計算を異なるようにコードを追加

tfidf=0: 何もしない

tfidf=1: 上記のTF-IDF 1の計算式を適用

tfidf=2: 上記のTF-IDF 2の計算式を適用

実行例

  source("getFreqDir.R")
  res <- getFreqDir("testData")
  round(res,2)

##   test1 test2 test3
## a     3     2     2
## b     4     4     0
## c    13     2     3
## d     0     0     1
## e     7     1     1
## f     0    11     9
## g     0     7     7
## h     0     0     4

  res1 <- getFreqDir("testData",tfidf=1)
  round(res1,2)

##   test1 test2 test3
## a  0.00  0.00  0.00
## b  1.62  1.62  0.00
## c  0.00  0.00  0.00
## d  0.00  0.00  1.10
## e  0.00  0.00  0.00
## f  0.00  4.46  3.65
## g  0.00  2.84  2.84
## h  0.00  0.00  4.39

  res2 <- getFreqDir("testData",tfidf=2)
  round(res2,2)

##   test1 test2 test3
## a  3.00  2.00  2.00
## b  5.62  5.62  0.00
## c 13.00  2.00  3.00
## d  0.00  0.00  2.10
## e  7.00  1.00  1.00
## f  0.00 15.46 12.65
## g  0.00  9.84  9.84
## h  0.00  0.00  8.39

★相関係数

ピアソン積率相関係数

\[Corr(x,y)= \frac{\sum (x_{i}-\overline{x}) (y_{i}-\overline{y})}{\sqrt{\sum (x_{i}-\overline{x})^2\sum (y_{i}-\overline{y})^2}} \] #### 相関係数行列（テキスト間）

  tf <- getFreqDir("testData")
  res <-cor(tf)
  round(res,2)

##       test1 test2 test3
## test1  1.00 -0.29 -0.38
## test2 -0.29  1.00  0.80
## test3 -0.38  0.80  1.00

結果をファイルに出力

  res <-cor(tf)
  write.csv(res,"test.csv")

相関係数行列（単語間）

行と列を転置する

  t(tf)

##       a b  c d e  f g h
## test1 3 4 13 0 7  0 0 0
## test2 2 4  2 0 1 11 7 0
## test3 2 0  3 1 1  9 7 4

  round(cor(t(tf)),2)

##       a     b     c     d     e     f     g     h
## a  1.00  0.50  1.00 -0.50  1.00 -0.99 -1.00 -0.50
## b  0.50  1.00  0.43 -1.00  0.50 -0.34 -0.50 -1.00
## c  1.00  0.43  1.00 -0.43  1.00 -1.00 -1.00 -0.43
## d -0.50 -1.00 -0.43  1.00 -0.50  0.34  0.50  1.00
## e  1.00  0.50  1.00 -0.50  1.00 -0.99 -1.00 -0.50
## f -0.99 -0.34 -1.00  0.34 -0.99  1.00  0.99  0.34
## g -1.00 -0.50 -1.00  0.50 -1.00  0.99  1.00  0.50
## h -0.50 -1.00 -0.43  1.00 -0.50  0.34  0.50  1.00

  #res1

散布図

  plot(tf[,1],tf[,2], type="n",xlab=colnames(tf)[1],ylab=colnames(tf)[2])
  text(tf[,1],tf[,2],rownames(tf))
  cor(tf[,1],tf[,2])

## [1] -0.2876135

  mtext(paste("corr = " , round(cor(tf[,1],tf[,2]),2)), side=3)

実習2: manipulateでインタラクティブな散布図を作成

library(manipulate)

ヒント；picker関数を利用

出力イメージ

alt text

★proxyパッケージによる類似度計算

proxyのインストール

  install.packages("proxy")

proxyの読み込み

  library(proxy)

## 
## Attaching package: 'proxy'
## 
##  以下のオブジェクトは 'package:stats' からマスクされています: 
## 
##      as.dist, dist 
## 
##  以下のオブジェクトは 'package:base' からマスクされています: 
## 
##      as.matrix

変数間の相関行列

行と列を転置する

  simil(t(tf))

##            test1      test2
## test2 -0.2876135           
## test3 -0.3797959  0.7966974

変数間の相関行列（対角行列で結果表示）

  simil(t(tf), diag=T)

##            test1      test2      test3
## test1  0.0000000                      
## test2 -0.2876135  0.0000000           
## test3 -0.3797959  0.7966974  0.0000000

単語間の相関行列

  simil(tf)

##           a         b         c         d         e         f         g
## b 0.8396788                                                            
## c 0.7065527 0.5975136                                                  
## d 0.8254338 0.7391867 0.5319865                                        
## e 0.8300958 0.7951308 0.7417767 0.7902098                              
## f 0.3910904 0.3519814 0.1717172 0.3703704 0.2211862                    
## g 0.5863766 0.5472675 0.3670034 0.5656566 0.4164724 0.8047138          
## h 0.7883968 0.6280756 0.5690236 0.8888889 0.6790987 0.4814815 0.6767677

テキスト間のコサイン類似度

\[Cos(x,y)= \frac{\sum x_{i} y_{i}}{\sqrt{\sum x_{i}^2\sum y_{i}^2}} \]

  simil(t(tf), method="cosine")

##           test1     test2
## test2 0.2526633          
## test3 0.2628980 0.8973604

実習3: デレクトリ“univ”内のテキストを使用して、テキスト間の相関係数を計算する

結果出力例：相関係数（テキスト間）

##           hiroshima kufs kyoto osaka1 osaka2 osaka3 tokyo waseda
## hiroshima      1.00 0.63  0.71   0.67   0.67   0.65  0.60   0.71
## kufs           0.63 1.00  0.80   0.62   0.71   0.76  0.74   0.79
## kyoto          0.71 0.80  1.00   0.75   0.82   0.87  0.81   0.86
## osaka1         0.67 0.62  0.75   1.00   0.84   0.80  0.71   0.75
## osaka2         0.67 0.71  0.82   0.84   1.00   0.89  0.80   0.80
## osaka3         0.65 0.76  0.87   0.80   0.89   1.00  0.84   0.81
## tokyo          0.60 0.74  0.81   0.71   0.80   0.84  1.00   0.76
## waseda         0.71 0.79  0.86   0.75   0.80   0.81  0.76   1.00

実習4: デレクトリ“univ”内のテキストを使用して、テキスト間のコサイン類似度を計算する

結果出力例：コサイン類似度（テキスト間）

##        hiroshima kufs kyoto osaka1 osaka2 osaka3 tokyo
## kufs        0.65                                      
## kyoto       0.73 0.81                                 
## osaka1      0.68 0.65  0.77                           
## osaka2      0.69 0.73  0.83   0.84                    
## osaka3      0.66 0.77  0.87   0.81   0.90             
## tokyo       0.62 0.75  0.81   0.72   0.80   0.84      
## waseda      0.73 0.80  0.87   0.77   0.81   0.82  0.77

★クラスター分析1（デフォルト値：complete法 euclidean距離))

  tf <- getFreqDir("univ")
  hc <- hclust(dist(t(tf)))
  plot(hc)
  rect.hclust(hc, k=3, border="red")

★クラスター分析2(word法, キャンベラ距離)

  tf <- getFreqDir("univ")
  hc <- hclust(dist(t(tf), method = "canberra"), method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

  plot(hc)
  rect.hclust(hc, k=3, border="red")