Lecture 3: データの整形（その２）

前回の復習：単語出現頻度表の作成

テキストファイルの読み込み

一行ずつ読み込んで、リストに格納

txt <- readLines("Data/osaka-u.txt", encoding = "utf8")

一行目の内容

txt[1]

## [1] "\"Osaka Imperial University\" was founded in 1931 thanks to the enthusiastic support of the citizens of Osaka and persons involved in the university with two schools, Medicine and Science. However, our university's roots reach back to Tekijuku, a private \"place of learning\" founded by the doctor and scholar of Western sciences OGATA Koan at the end of the Edo period. Tekijuku's open academic culture and forward-looking spirit eventually gave birth to Osaka Prefecture Medical School and, eventually, to today's Osaka University."

読み込んだ行数

length(txt)

## [1] 21

スペース&記号による分割

wordLst <- strsplit(txt, "[[:space:]]|[[:punct:]]")

各行のデータを一括化

wordLst <- unlist(wordLst)

小文字に変換

wordLst <- tolower(wordLst)

空白"“の削除(どちらか好きなほう)

wordLst <- wordLst[nchar(wordLst) > 0]
wordLst <- wordLst[wordLst != ""]

単語のToken数

tokens <- length(wordLst)
tokens

## [1] 689

単語のTypes数

unique()関数は，リストの重複しない要素を返す

types <- length(unique(wordLst))
types

## [1] 305

TTR: Type-Token Ratioの計算

\[ TTR=\frac{types}{tokens} \times 100 \]

types/tokens * 100

## [1] 44.27

単語の頻度数

freqLst <- sort(table(wordLst), decreasing = TRUE)

単語の頻度数(上位5語)

freqLst[1:5]

## wordLst
##        the        and         of university      osaka 
##         41         33         30         28         23

単語頻度数分布

subfreq <- freqLst[1:10]
title = "Word Frequency Distribution"
xlabel = "Word"
ylabel = "Frequency"
barplot(subfreq, main = title, xlab = xlabel, ylab = ylabel, las = 3)

plot of chunk unnamed-chunk-13

単語頻度数分布(色付き)

colors = c("red", "yellow", "green", "violet", "orange", "blue", "pink", "cyan")
barplot(subfreq, col = colors, main = title, xlab = xlabel, ylab = ylabel, las = 3)

plot of chunk unnamed-chunk-14

色の種類

colors()[1:10]

##  [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
##  [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"   
##  [9] "aquamarine1"   "aquamarine2"

manipulate package

インタラクティブなプロット

library(manipulate)

色の選択

picker()関数

manipulate(barplot(subfreq, col = myColors, main = title, xlab = xlabel, ylab = ylabel, 
    las = 3), myColors = picker("red", "yellow", "green", "violet", "orange", 
    "blue", "pink", "cyan"))

alt text

スライダーの追加

manipulate(barplot(freqLst, col = myColors, main = title, xlab = xlabel, ylab = ylabel, 
    xlim = c(0, x.max), las = 3), myColors = picker("red", "yellow", "green", 
    "violet", "orange", "blue", "pink", "cyan"), x.max = slider(5, 300, initial = 10))

alt text

相対頻度数

"test1.txt"ファイルの読み込み

一行ずつ読み込んで、リストに格納

txt <- readLines("Data/test1.txt")

スペース&記号による分割

wordLst <- strsplit(txt, "[[:space:]]|[[:punct:]]")

各行のデータを一括化

wordLst <- unlist(wordLst)

頻度数の集計

freq <- sort(table(wordLst), decreasing = TRUE)
freq

## wordLst
##  c  e  b  a 
## 13  7  4  3

相対頻度数

全体を１としたときの出現率

relative <- freq/sum(freq)

相対頻度の合計

sum(relative)

## [1] 1

小数点

round(relative, 2)

## wordLst
##    c    e    b    a 
## 0.48 0.26 0.15 0.11

データ型に変換

freqData <- data.frame(word = rownames(freq), freq = freq)
relativeData <- data.frame(word = rownames(relative), freq = relative)

##   word freq
## c    c   13
## e    e    7
## b    b    4
## a    a    3

##   word   freq
## c    c 0.4815
## e    e 0.2593
## b    b 0.1481
## a    a 0.1111


#### ２つのデータ型変数を連結(merge)

```r
freqMtx <- merge(freqData, relativeData, all = T, by = "word")

列に名前をつける

names(freqMtx) <- c("term", "raw", "relative")

粗頻度でソート

freqOrder <- order(freqMtx$raw, decreasing = TRUE)
freqMtx <- freqMtx[freqOrder, ]

##   term raw relative
## 3    c  13   0.4815
## 4    e   7   0.2593
## 2    b   4   0.1481
## 1    a   3   0.1111

関数の作成

alt text

関数の作成例：二乗計算

alt text

自作関数:calcSquare

calcSquare.Rを作成

calcSquare <- function(arg) {
    square <- arg^2
    return(square)
}

calcSquare.Rを読み込む

source("calcSquare.R")

calcSquare(3)

## [1] 9

自作関数:getRawFreqMtx

getRawFreqMtx.Rを作成

getRawFreqMtx <- function(filename) {

    txt <- readLines(filename, encoding = "utf8")

    wordLst <- strsplit(txt, "[[:space:]]|[[:punct:]]")
    wordLst <- unlist(wordLst)
    wordLst <- tolower(wordLst)
    wordLst <- wordLst[wordLst != ""]

    freq <- table(wordLst)
    freqData <- data.frame(freq)
    freqOrder <- order(freqData$Freq, decreasing = TRUE)
    freqData <- freqData[freqOrder, ]

    return(freqData)
}

getRawFreqMtx.Rを読み込む

source("getRawFreqMtx.R")

getRawFreqMtx関数の実行例

getRawFreqMtx("Data/test1.txt")

##   wordLst Freq
## 3       c   13
## 4       e    7
## 2       b    4
## 1       a    3

今日の課題（締め切り10月28日）

関数ファイル（getFreqMtx.R）をメールで提出すること

テキストファイル名を引数にして、単語の出現頻度と相対頻度の行列データを出力するgetFreqMtx関数を作成しなさい。
"osaka-u.txt"を使用して、正しく実行できるか確認すること。

出力イメージ

getFreqMtx("Data/test1.txt")

##   term raw relative
## 3    c  13   0.4815
## 4    e   7   0.2593
## 2    b   4   0.1481
## 1    a   3   0.1111