一行ずつ読み込んで、リストに格納
txt <- readLines("Data/osaka-u.txt", encoding = "utf8")
txt[1]
## [1] "\"Osaka Imperial University\" was founded in 1931 thanks to the enthusiastic support of the citizens of Osaka and persons involved in the university with two schools, Medicine and Science. However, our university's roots reach back to Tekijuku, a private \"place of learning\" founded by the doctor and scholar of Western sciences OGATA Koan at the end of the Edo period. Tekijuku's open academic culture and forward-looking spirit eventually gave birth to Osaka Prefecture Medical School and, eventually, to today's Osaka University."
length(txt)
## [1] 21
wordLst <- strsplit(txt, "[[:space:]]|[[:punct:]]")
wordLst <- unlist(wordLst)
wordLst <- tolower(wordLst)
wordLst <- wordLst[nchar(wordLst) > 0]
wordLst <- wordLst[wordLst != ""]
tokens <- length(wordLst)
tokens
## [1] 689
types <- length(unique(wordLst))
types
## [1] 305
\[ TTR=\frac{types}{tokens} \times 100 \]
types/tokens * 100
## [1] 44.27
freqLst <- sort(table(wordLst), decreasing = TRUE)
freqLst[1:5]
## wordLst
## the and of university osaka
## 41 33 30 28 23
subfreq <- freqLst[1:10]
title = "Word Frequency Distribution"
xlabel = "Word"
ylabel = "Frequency"
barplot(subfreq, main = title, xlab = xlabel, ylab = ylabel, las = 3)
colors = c("red", "yellow", "green", "violet", "orange", "blue", "pink", "cyan")
barplot(subfreq, col = colors, main = title, xlab = xlabel, ylab = ylabel, las = 3)
colors()[1:10]
## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"
## [9] "aquamarine1" "aquamarine2"
インタラクティブなプロット
library(manipulate)
picker()関数
manipulate(barplot(subfreq, col = myColors, main = title, xlab = xlabel, ylab = ylabel,
las = 3), myColors = picker("red", "yellow", "green", "violet", "orange",
"blue", "pink", "cyan"))
manipulate(barplot(freqLst, col = myColors, main = title, xlab = xlabel, ylab = ylabel,
xlim = c(0, x.max), las = 3), myColors = picker("red", "yellow", "green",
"violet", "orange", "blue", "pink", "cyan"), x.max = slider(5, 300, initial = 10))
一行ずつ読み込んで、リストに格納
txt <- readLines("Data/test1.txt")
wordLst <- strsplit(txt, "[[:space:]]|[[:punct:]]")
wordLst <- unlist(wordLst)
freq <- sort(table(wordLst), decreasing = TRUE)
freq
## wordLst
## c e b a
## 13 7 4 3
全体を1としたときの出現率
relative <- freq/sum(freq)
sum(relative)
## [1] 1
round(relative, 2)
## wordLst
## c e b a
## 0.48 0.26 0.15 0.11
freqData <- data.frame(word = rownames(freq), freq = freq)
relativeData <- data.frame(word = rownames(relative), freq = relative)
## word freq
## c c 13
## e e 7
## b b 4
## a a 3
## word freq
## c c 0.4815
## e e 0.2593
## b b 0.1481
## a a 0.1111
#### 2つのデータ型変数を連結(merge)
```r
freqMtx <- merge(freqData, relativeData, all = T, by = "word")
names(freqMtx) <- c("term", "raw", "relative")
freqOrder <- order(freqMtx$raw, decreasing = TRUE)
freqMtx <- freqMtx[freqOrder, ]
## term raw relative
## 3 c 13 0.4815
## 4 e 7 0.2593
## 2 b 4 0.1481
## 1 a 3 0.1111
calcSquare.Rを作成
calcSquare <- function(arg) {
square <- arg^2
return(square)
}
source("calcSquare.R")
calcSquare(3)
## [1] 9
getRawFreqMtx.Rを作成
getRawFreqMtx <- function(filename) {
txt <- readLines(filename, encoding = "utf8")
wordLst <- strsplit(txt, "[[:space:]]|[[:punct:]]")
wordLst <- unlist(wordLst)
wordLst <- tolower(wordLst)
wordLst <- wordLst[wordLst != ""]
freq <- table(wordLst)
freqData <- data.frame(freq)
freqOrder <- order(freqData$Freq, decreasing = TRUE)
freqData <- freqData[freqOrder, ]
return(freqData)
}
source("getRawFreqMtx.R")
getRawFreqMtx("Data/test1.txt")
## wordLst Freq
## 3 c 13
## 4 e 7
## 2 b 4
## 1 a 3
テキストファイル名を引数にして、単語の出現頻度と相対頻度の行列データを出力するgetFreqMtx関数を作成しなさい。
"osaka-u.txt"を使用して、正しく実行できるか確認すること。
getFreqMtx("Data/test1.txt")
## term raw relative
## 3 c 13 0.4815
## 4 e 7 0.2593
## 2 b 4 0.1481
## 1 a 3 0.1111