前回の課題

先週のおまけ課題

課題が終わって時間が余っている人は、変数tfの列を選択して、他のテキストも表示できる方法を考えてください。

実際にコードを書く必要はありません。やり方を思いついたら私に説明してください。

準備

getFreqDir関数の読み込み

source("getFreqDir.R")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

univディレクトリ内の頻度表の作成

univTable <- getFreqDir("univ")

Shiny Appでリストからテキストを選択

フォルダ“app_freqBar2_practice1” ### 　ui.Rの抜粋

  selectInput(inputId ="univName", 
    label = "Choose a text:", 
    choices = colnames(tf),
    selected = colnames(tf)[4]),

　server.Rの抜粋

currentText <- input$univName
    textFreq<- tf[order(tf[,colnames(tf)==currentText],decreasing=TRUE),]
    
    freq<-textFreq[,colnames(textFreq)==currentText][1:50]
    label<-rownames(textFreq)[1:50]

確認問題１

フォルダ“app_freqBar2_practice1”を使用して、選択したテキストの“wordcloud”が描画されるようにserver.Rを変更しなさい。

+10分以上考えて分からない場合は、質問してください。

実行例

頻度数の重み付け(説明ビデオ１)

Term Frequency-Inverse Document Frequency

複数のテキストに共通して出現する単語の頻度数を少なく重み付け

TF-IDF 1

\[w=tf*log(\frac{N}{df}) \]

tf: term frequency
df: document frequency

テストデータ

(testTF<- getFreqDir("testData"))

##   test1 test2 test3 test4
## c    13     2     3     5
## e     7     1     1     2
## b     4     4     0     4
## a     3     2     2     4
## f     0    11     9    20
## g     0     7     7    14
## h     0     0     4     4
## d     0     0     1     1

補足：相対頻度行列

colSums(testTF)

## test1 test2 test3 test4 
##    27    27    27    54

testTF/colSums(testTF)

##        test1      test2      test3      test4
## c 0.48148148 0.07407407 0.11111111 0.18518519
## e 0.25925926 0.03703704 0.03703704 0.07407407
## b 0.14814815 0.14814815 0.00000000 0.14814815
## a 0.05555556 0.03703704 0.03703704 0.07407407
## f 0.00000000 0.40740741 0.33333333 0.74074074
## g 0.00000000 0.25925926 0.25925926 0.51851852
## h 0.00000000 0.00000000 0.14814815 0.14814815
## d 0.00000000 0.00000000 0.01851852 0.01851852

補足：行の総数でソート（全テキストでの最頻度順）

sort(rowSums(testTF),decreasing=TRUE)

##  f  g  c  b  e  a  h  d 
## 40 28 23 12 11 11  8  2

TF-IDFを計算

  N<-ncol(testTF)
  testDF<-apply(testTF, 1, function(x) length(x[x>0]) )
  testWeighted<-testTF*log(N/testDF)
  round(testWeighted,2)

##   test1 test2 test3 test4
## c  0.00  0.00  0.00  0.00
## e  0.00  0.00  0.00  0.00
## b  1.15  1.15  0.00  1.15
## a  0.00  0.00  0.00  0.00
## f  0.00  3.16  2.59  5.75
## g  0.00  2.01  2.01  4.03
## h  0.00  0.00  2.77  2.77
## d  0.00  0.00  0.69  0.69

すべての値が０の行を削除

  (testTFIDF <- testWeighted[rowSums(testWeighted)>0,])

##      test1    test2     test3     test4
## b 1.150728 1.150728 0.0000000 1.1507283
## f 0.000000 3.164503 2.5891387 5.7536414
## g 0.000000 2.013775 2.0137745 4.0275490
## h 0.000000 0.000000 2.7725887 2.7725887
## d 0.000000 0.000000 0.6931472 0.6931472

テキスト間のコサイン類似度の比較

\[Cos(x,y)= \frac{\sum x_{i} y_{i}}{\sqrt{\sum x_{i}^2\sum y_{i}^2}} \]

proxyパッケージの読み込み

library(proxy)

## 
## Attaching package: 'proxy'

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

頻度数でcosine類似度を計算

round(simil(t(testTF), method="cosine"),2)

##       test1 test2 test3
## test2  0.25            
## test3  0.26  0.90      
## test4  0.26  0.98  0.97

TF-IDFデータでcosine類似度を計算

round(simil(t(testTFIDF), method="cosine"),2)

##       test1 test2 test3
## test2  0.29            
## test3  0.00  0.72      
## test4  0.15  0.92  0.93

“univ”デレクトリでの実習

相対頻度表の作成

tf <- getFreqDir("univ")

View関数で粗頻度行列を確認

View(tf)

全テキストの上位30最頻単語の抽出

top30<-sort(rowSums(tf),decreasing=TRUE)[1:30]
names(top30)

##  [1] "the"        "of"         "and"        "to"         "in"        
##  [6] "university" "a"          "is"         "that"       "as"        
## [11] "osaka"      "research"   "with"       "for"        "its"       
## [16] "education"  "on"         "it"         "i"          "students"  
## [21] "s"          "we"         "by"         "be"         "will"      
## [26] "this"       "society"    "knowledge"  "are"        "an"

階層的クラスター分析

全データ

hc <- hclust(dist(t(tf), method = "cosine"), method = "ward.D2")
plot(hc)

最頻上位30

TFtop30<-tf[rownames(tf) %in% names(top30),]
hc <- hclust(dist(t(TFtop30), method = "cosine"), method = "ward.D2")
plot(hc)

非階層クラスター分析: kmeans (説明ビデオ2)

library(cluster)

乱数の再現

set.seed(1209)

km.freq<-kmeans(TFtop30, centers=5, iter.max=100)
km.freq$cluster

##         to        and        the university         of          a        for 
##          4          5          5          4          5          4          2 
##         in         be         is         as          i   research          s 
##          4          2          3          3          2          2          2 
##  education         it  knowledge         on       with       that   students 
##          2          2          2          2          3          3          2 
##        are       will         an       this         we        its         by 
##          2          2          2          2          2          2          2 
##    society      osaka 
##          2          1

clusplot関数によるデータプロット；主成分(princomp)による2次元配置

clusplot(TFtop30, km.freq$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

拡大図

clusplot(TFtop30, km.freq$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, xlim=c(-1.8,-0.7),ylim=c(-0.7,1.0))

クラスター中心行列（クラスターの中心ベクトル）

round(km.freq$centers,4)

##   hiroshima    kufs   kyoto  osaka1  osaka2  osaka3   tokyo  waseda
## 1    0.0000  0.0000  0.0000  9.0000 18.0000 14.0000  0.0000  0.0000
## 2    0.9444  2.1111  5.2222  2.0000  2.3889  3.2222  3.5556  5.2778
## 3    1.5000  9.2500  6.0000  3.0000  5.0000  6.7500  7.5000  9.5000
## 4    6.2500 13.5000 22.2500 10.2500 15.7500 12.7500  9.5000 28.2500
## 5    6.3333 20.3333 36.3333 13.6667 27.6667 31.0000 36.6667 36.6667

osaka1

boxplot(TFtop30[,4]~km.freq$cluster,data=TFtop30,col="lightblue")

wasaeda

boxplot(TFtop30[,8]~km.freq$cluster,data=TFtop30,col="lightblue")

コーパス言語学B: Lecture09 (Fall 2020)

Lecture9: TF-IDF, Kmeans

前回の課題

先週のおまけ課題

課題が終わって時間が余っている人は、変数tfの列を選択して、他のテキストも表示できる方法を考えてください。

実際にコードを書く必要はありません。やり方を思いついたら私に説明してください。

準備

getFreqDir関数の読み込み

univディレクトリ内の頻度表の作成

Shiny Appでリストからテキストを選択

server.Rの抜粋

確認問題１

フォルダ“app_freqBar2_practice1”を使用して、選択したテキストの“wordcloud”が描画されるようにserver.Rを変更しなさい。

頻度数の重み付け(説明ビデオ１)

Term Frequency-Inverse Document Frequency

TF-IDF 1

テストデータ

補足：相対頻度行列

補足：行の総数でソート（全テキストでの最頻度順）

TF-IDFを計算

すべての値が０の行を削除

テキスト間のコサイン類似度の比較

proxyパッケージの読み込み

頻度数でcosine類似度を計算

TF-IDFデータでcosine類似度を計算

“univ”デレクトリでの実習

相対頻度表の作成

View関数で粗頻度行列を確認

全テキストの上位30最頻単語の抽出

階層的クラスター分析

全データ

最頻上位30

非階層クラスター分析: kmeans (説明ビデオ2)

乱数の再現

clusplot関数によるデータプロット；主成分(princomp)による2次元配置

拡大図

クラスター中心行列（クラスターの中心ベクトル）

osaka1

wasaeda

今日の課題（締切日12月16日）

“getFreqDir.R”に引数を追加し、粗頻度またはTF-IDF行列を結果出力させる関数に拡張させなさい。

課題ができた人は、RStudio cloud上で実行したファイル名をメールで送ってください。

実行例

getFreqDir関数の再読み込み

引数を何も指定しない場合->粗頻度行列

tfidf引数に”1”を指定した場合->TF-IDF行列

　server.Rの抜粋