Lecture 6: Rでテキスト処理(復習)

前回の復習

単語出現頻度表の作成

テキストファイルの読み込み

一行ずつ読み込んで、リストに格納

txt <- readLines("tufs.txt")

一行目の内容

txt[1]

## [1] "These days, raising “global human resources” is being preached from various areas of industry, government, and academia. After entering the 21st century, the borderline between domestic and overseas markets has disappeared, people and objects move dramatically across national borders, and globalization is further progressing. With the economics, societies and cultures of various regions of the world being swallowed into globalization, in order to deal effectively with various situations, it is necessary to view these situations from above like a bird looking down from the sky. Global human resources have the ability to accurately overlook these situations in their entirety with a wide, global sense of view, and they are needed for this reason."

読み込んだ行数

length(txt)

## [1] 5

スペース&記号による分割

wordL <- strsplit(txt, "[[:space:]]|[[:punct:]]")

各行のデータを一括化

wordL <- unlist(wordL)

小文字に変換

wordL <- tolower(wordL)

空白"“の削除

wordL <- wordL[nchar(wordL) > 0]
wordL <- wordL[wordL != ""]

単語のToken数

tokens <- length(wordL)
tokens

## [1] 548

単語のTypes数

unique()関数は，リストの重複しない要素を返す

types <- length(unique(wordL))
types

## [1] 244

TTR: Type-Token Ratioの計算

\[ TTR=\frac{types}{tokens} \times 100 \]

types/tokens * 100

## [1] 44.53

単語の頻度数

freqL <- sort(table(wordL), decreasing = TRUE)

単語の頻度数(上位5語)

freqL[1:5]

## wordL
##      of     and     the studies      in 
##      36      34      34      15      14

結果をファイルに出力

write.csv(freqL, "freq-tufs.csv")

Yule'sのK特性値

\[ K=10000 \times \frac{(\sum m^2 \times freq(m)) -tokens}{tokens^2} \]

頻度スペクトラムの作成

mFreq <- table(freqL)

単語のToken数

tokens

## [1] 548

頻度パタン(m)とその頻度

names(mFreq[3])

## [1] "3"

mFreq[3]

## 3 
## 8

as.numeric(names(mFreq[3])) * mFreq[3]

##  3 
## 24

Yule'sのK特性値の部分計算1

\[ m^2 \times freq(m) \]

as.numeric(names(mFreq[3])) * mFreq[3]

##  3 
## 24

m2 <- mapply(function(x, y) as.numeric(x)^2 * y, names(mFreq), mFreq)

Yule'sのK特性値の部分計算2

\[ \sum( m^2 \times freq(m)) \]

sum(m2)

## [1] 5504

Yule'sのK特性値

K <- 10000 * (sum(m2) - tokens)/tokens^2
K

## [1] 165