Lecture7: Shiny(その2) & 類似度計算

★準備

library

library(wordcloud)

## Loading required package: RColorBrewer

library(shiny)

getFreqDir.Rを読み込む

source("getFreqDir.R")

univディレクトリから頻度表を作成

tf <- getFreqDir("univ")
head(tf)

##     hiroshima kufs kyoto osaka1 osaka2 osaka3 tokyo waseda
## 000         0    0     0      0      0      0     0      2
## 1           0    0     0      1      0      0     0      1
## 10          0    0     1      0      0      0     0      1
## 100         0    0     0      0      1      0     0      0
## 11          2    0     0      0      0      0     0      0
## 12          1    0     0      0      0      0     0      0

　練習:tfの“waseda”列の情報でwordcloudを作る

実行結果

　複数文書比較：comparison.cloud()の実行例

hiroshima, kufs

comparison.cloud(tf[,1:2])

osaka1,osaka3

comparison.cloud(tf[,c(4,6)],max.words=40)

　comparison.cloud()の実行例

colnames(tf)
comparison.cloud(tf[,c(F,F,T,F,F,T,T,F)],max.words=40)

RColorBrewer

head(brewer.pal.info[])

##      maxcolors category colorblind
## BrBG        11      div       TRUE
## PiYG        11      div       TRUE
## PRGn        11      div       TRUE
## PuOr        11      div       TRUE
## RdBu        11      div       TRUE
## RdGy        11      div      FALSE

rownames(brewer.pal.info[])

##  [1] "BrBG"     "PiYG"     "PRGn"     "PuOr"     "RdBu"     "RdGy"    
##  [7] "RdYlBu"   "RdYlGn"   "Spectral" "Accent"   "Dark2"    "Paired"  
## [13] "Pastel1"  "Pastel2"  "Set1"     "Set2"     "Set3"     "Blues"   
## [19] "BuGn"     "BuPu"     "GnBu"     "Greens"   "Greys"    "Oranges" 
## [25] "OrRd"     "PuBu"     "PuBuGn"   "PuRd"     "Purples"  "RdPu"    
## [31] "Reds"     "YlGn"     "YlGnBu"   "YlOrBr"   "YlOrRd"

RColorBrewer

binfo<-brewer.pal.info[]
palets <-rownames(binfo[binfo$maxcolors>10,])

tmpPalet = palets[1]
(maxNum=binfo[rownames(binfo)==tmpPalet,][1])

##      maxcolors
## BrBG        11

wordcloud(rownames(tf),tf$waseda,
          colors=brewer.pal(as.integer(maxNum),tmpPalet))

tmpPalet = palets[5]
wordcloud(rownames(tf),tf$waseda,
          colors=brewer.pal(as.integer(maxNum),tmpPalet),min.freq=5)

Shiny実習（後ほど）

実習1: app_zipfを改良して、定数A, Kの値をスライダーでインタラクティブに動かす

定数Aの初期値: 0.7

定数Kの初期値: 最大頻度数(K=freqMtx[1,1])

alt text

実習2:

RColorBrewerのパレットで色を選んで、WordCloudを描画

min.freqをスライダーで可変

alt text

★相関係数

ピアソン積率相関係数

\[Corr(x,y)= \frac{\sum (x_{i}-\overline{x}) (y_{i}-\overline{y})}{\sqrt{\sum (x_{i}-\overline{x})^2\sum (y_{i}-\overline{y})^2}} \]

相関係数行列（テキスト間）

  tf <- getFreqDir("testData")
  res <-cor(tf)
  round(res,2)

##       test1 test2 test3 test4
## test1  1.00 -0.29 -0.38 -0.35
## test2 -0.29  1.00  0.80  0.96
## test3 -0.38  0.80  1.00  0.94
## test4 -0.35  0.96  0.94  1.00

相関係数行列（単語間）

行と列を転置する

  t(tf)

##       a b  c d e  f  g h
## test1 3 4 13 0 7  0  0 0
## test2 2 4  2 0 1 11  7 0
## test3 2 0  3 1 1  9  7 4
## test4 4 4  5 1 2 20 14 4

  round(cor(t(tf)),2)

##      a     b     c     d     e     f     g     h
## a 1.00  0.52  0.40  0.30  0.33  0.42  0.43  0.30
## b 0.52  1.00  0.37 -0.58  0.41  0.08  0.00 -0.58
## c 0.40  0.37  1.00 -0.40  0.99 -0.66 -0.65 -0.40
## d 0.30 -0.58 -0.40  1.00 -0.50  0.63  0.71  1.00
## e 0.33  0.41  0.99 -0.50  1.00 -0.71 -0.71 -0.50
## f 0.42  0.08 -0.66  0.63 -0.71  1.00  1.00  0.63
## g 0.43  0.00 -0.65  0.71 -0.71  1.00  1.00  0.71
## h 0.30 -0.58 -0.40  1.00 -0.50  0.63  0.71  1.00

散布図

  plot(tf[,1],tf[,2], type="n",xlab=colnames(tf)[1],ylab=colnames(tf)[2])
  text(tf[,1],tf[,2],rownames(tf))
  cor(tf[,1],tf[,2])

## [1] -0.2876135

  mtext(paste("corr = " , round(cor(tf[,1],tf[,2]),2)), side=3)

デレクトリ“univ”内のテキストを使用して、テキスト間の相関係数を計算する

結果出力例：相関係数（テキスト間）

tf <- getFreqDir("univ")
res <-cor(tf)

proxyパッケージによる類似度計算

proxyパッケージのインストール

install.packages("proxy")

proxyの読み込み

library(proxy)

## 
## Attaching package: 'proxy'

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

変数間の相関行列

行と列を転置する

corr <- simil(t(tf))
round(corr, 2)

##        hiroshima kufs kyoto osaka1 osaka2 osaka3 tokyo
## kufs        0.63                                      
## kyoto       0.71 0.80                                 
## osaka1      0.67 0.62  0.75                           
## osaka2      0.67 0.71  0.82   0.84                    
## osaka3      0.65 0.76  0.87   0.80   0.89             
## tokyo       0.60 0.74  0.81   0.71   0.80   0.84      
## waseda      0.71 0.79  0.86   0.75   0.80   0.81  0.76

変数間の相関行列（対角行列で結果表示）

corr <- simil(t(tf), diag=T)
round(corr, 2)

##           hiroshima kufs kyoto osaka1 osaka2 osaka3 tokyo waseda
## hiroshima        NA                                             
## kufs           0.63   NA                                        
## kyoto          0.71 0.80    NA                                  
## osaka1         0.67 0.62  0.75     NA                           
## osaka2         0.67 0.71  0.82   0.84     NA                    
## osaka3         0.65 0.76  0.87   0.80   0.89     NA             
## tokyo          0.60 0.74  0.81   0.71   0.80   0.84    NA       
## waseda         0.71 0.79  0.86   0.75   0.80   0.81  0.76     NA

★テキスト間のコサイン類似度

\[Cos(x,y)= \frac{\sum x_{i} y_{i}}{\sqrt{\sum x_{i}^2\sum y_{i}^2}} \]

simil(t(tf), method="cosine")

##        hiroshima      kufs     kyoto    osaka1    osaka2    osaka3
## kufs   0.6526589                                                  
## kyoto  0.7279603 0.8104367                                        
## osaka1 0.6841531 0.6458906 0.7676934                              
## osaka2 0.6885497 0.7264594 0.8327895 0.8434805                    
## osaka3 0.6645501 0.7703932 0.8757348 0.8060690 0.8955915          
## tokyo  0.6187031 0.7488038 0.8125733 0.7176287 0.8041446 0.8419948
## waseda 0.7278519 0.8020611 0.8653391 0.7658753 0.8107266 0.8200624
##            tokyo
## kufs            
## kyoto           
## osaka1          
## osaka2          
## osaka3          
## tokyo           
## waseda 0.7712384

デレクトリ“univ”内のテキストを使用して、テキスト間のコサイン類似度を計算する

結果出力例：コサイン類似度（テキスト間）

tf <- getFreqDir("univ")
res <-simil(t(tf), method="cosine")

round(res,2)

##        hiroshima kufs kyoto osaka1 osaka2 osaka3 tokyo
## kufs        0.65                                      
## kyoto       0.73 0.81                                 
## osaka1      0.68 0.65  0.77                           
## osaka2      0.69 0.73  0.83   0.84                    
## osaka3      0.66 0.77  0.88   0.81   0.90             
## tokyo       0.62 0.75  0.81   0.72   0.80   0.84      
## waseda      0.73 0.80  0.87   0.77   0.81   0.82  0.77

★クラスター分析1（デフォルト値：complete法 euclidean距離))

tf <- getFreqDir("univ")
hc <- hclust(dist(t(tf)))
plot(hc)
rect.hclust(hc, k=3, border="red")

★クラスター分析2(ward法, キャンベラ距離)

tf <- getFreqDir("univ")
hc <- hclust(dist(t(tf), method = "canberra"), method = "ward.D")
plot(hc)
rect.hclust(hc, k=3, border="red")

★課題（締切日12月11日）

階層的クラスター分析のアプリケーション作成

“univ”ディレクトリのテキストデータを使用して、階層的クラスター分析のデンドログラムを表示するアプリケーションをShinyで作成してください。

ui画面には以下のinput機能を搭載
1. 距離行列計算法の選択："euclidean", "canberra", "manhattan"
2. クラスター間距離計算法の選択："average", "complete", "ward"
3. 樹形図内長方形の数：2-6