Lecture4: データの並び替え, 条件文

[前回の課題] getFreqMtx.R

getFreqMtx<-function(filename){
  txt<-readLines(filename)
  wordLst<-strsplit(txt,"[[:space:]]|[[:punct:]]")
  ....
  
  retrurn(freqMtx)
}

関数ファイルの読み込み

source("getFreqMtx.R")
freqMtx<-getFreqMtx("Lec04-text.txt")
head(freqMtx)

頻度順にソート

freqMtx<-freqMtx[order(freqMtx$raw, decreasing = TRUE),]
head(freqMtx)

データサイズ

dim(freqMtx)

## [1] 609   2

データの抽出（頻度数５以上）

smallMtx<- freqMtx[freqMtx$raw>=5,]

View関数

View(smallMtx)

データサイズ

dim(smallMtx)

## [1] 60  2

配列の条件抽出

配列の条件抽出：行・列

smallMtx[1,]
smallMtx[1:2,]
smallMtx[,2]
smallMtx[2,2]

配列の条件抽出：粗品度(7以上10より小さい)

smallMtx[smallMtx$raw >=7 & smallMtx$raw<10,]

配列の条件抽出：単語

smallMtx[rownames(smallMtx)=="covid",]

smallMtx[rownames(smallMtx)==c("osaka","covid"),]

smallMtx[rownames(smallMtx) %in% c("osaka","covid"),]

smallMtx[rownames(smallMtx)=="covid",]

配列の条件抽出：単語長

smallMtx[nchar(rownames(smallMtx))==3,]

正規表現による単語検索

stringrパッケージ

library(stringr)

“th”を含んでいる単語

smallMtx[str_detect(rownames(smallMtx), "th"),]

先頭が“th”である単語

smallMtx[str_detect(rownames(smallMtx), "^th"),]

末尾が“th”である単語

smallMtx[str_detect(rownames(smallMtx), "th$"),]

先頭が“a”である単語

smallMtx[str_detect(rownames(smallMtx), "^a"),]

複数の条件

先頭が“a” あるいは(|) 末尾が“th”である単語

smallMtx[str_detect(rownames(smallMtx), "^a|th$"),]

先頭が“t” で始まり末尾が“th”である単語

smallMtx[str_detect(rownames(smallMtx), "^t.+th$"),]

#smallMtx[str_detect(rownames(smallMtx), "^t\\w+th$"),]

#　条件文if ## calcSQRT

calcSQRT<- function(value) {
    return(sqrt(value))
}

calc2ndPower

calc2ndPower<- function(value) {
  return (value^2)
}

実行

tmp <- 100
calcSQRT(tmp)

## [1] 10

calc2ndPower(tmp)

## [1] 10000

条件文1

分岐条件: 分類値

calcTest1 <- function(value, type=1){
  if(type==1){
      ans = calcSQRT(value)
  }else if(type==2){
      ans = calc2ndPower(value)
  }
  return(ans)
}

実行

calcTest1(100)

## [1] 10

calcTest1(100,2)

## [1] 10000

条件文2

分岐条件: 真偽値

calcTest2 <- function(value, sqrtFlag=FALSE){
  if(sqrtFlag==TRUE){
      ans = calcSQRT(value)
  }else{
      ans = calc2ndPower(value)
  }
  return(ans)
}

実行

calcTest2(100)

## [1] 10000

calcTest2(100, sqrtFlag=TRUE)

## [1] 10

（復習）棒グラフ

tmp <- smallMtx[1:20,1]
names(tmp) <-rownames(smallMtx)[1:20]
barplot(tmp,las=3)

（復習）色付き棒グラフ

color8 = c("red", "violet", "pink", "orange", "yellow", "green", "blue", "cyan") 
barplot(tmp, las=1,col=color8)

manipulate package

インタラクティブなプロット

library(manipulate)

色の選択

picker()関数

manipulate(plot(0,0,pch=8,cex=5,col=myColors), myColors=picker("red", "violet", "pink", "orange", "yellow", "green", "blue", "cyan") )

プロットマーカーの選択

picker()関数

manipulate(
  plot(0,0,pch=myMarkers,cex=5,col=myColors), myColors=picker("red", "violet", "pink", "orange", "yellow", "green", "blue", "cyan",initial="violet"),
  myMarkers=picker(1,2,3,4,5,6,7,8,initial="5")
)

プロットサイズの選択

slider()関数

manipulate(
  plot(0,0,pch=8,cex=mySize,col="blue"),
  mySize=slider(1,10,initial=5)
)

manipulate関数の書き方（複数行の場合）

manipulate(
  {
  複数行にわたるスクリプト
  },
  picker, sliderの情報（複数の場合はカンマで結合）
)

実習

課題1: 変数“freqMtx”に格納されているデータを使用して、以下の条件抽出をしなさい。

課題1-1:相対頻度数 0.01以上 0.04以下のデータ

課題1-2:先頭が“c”で始まり末尾が“ed”である単語

課題2:

（復習）棒グラフをmanipulate関数を使用し、引数lasとcolをインタラクティブ表示させなさい。

注意: lasに選択できる値は、1,2,3から選ぶ

tmp <- smallMtx[1:20,1]
names(tmp) <-rownames(smallMtx)[1:20]
color8 = c("red", "violet", "pink", "orange", "yellow", "green", "blue", "cyan") 
barplot(tmp, las=1,col=color8)

実行例

alt text

余裕があれば…

Zipf’sの法則

\[Frequency=\frac{K}{Rank^A} \] K,A: 定数

K=freqMtx[1,1]
A=0.8

rank <- seq(1:dim(freqMtx)[1])
zipf <- K/rank^A

## [1] 90.00000 51.69143 37.37193 29.68893 24.83513 21.46454

グラフ図:Zipf’sの理論式

plot(zipf, log="xy", type="l",col="red" ,
xlim=c(1,nrow(freqMtx)),ylim=c(1,50),main="Zipf's Law", xlab="Rank", ylab="Frequency")

頻度散布図＆Zipf’sの理論式の重ね書き

par(new=T)
plot(rank,freqMtx[,1], xlim=c(1,nrow(freqMtx)), ylim=c(1,50),log="xy",pch=8, col="darkgreen", main="Zipf's Law", xlab="Rank", ylab="Frequency")

凡例をつける: legend

配置：“bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right”, “center” ラベル lty: 線の種類 pch: プロットの種類

legend("topright",c("Frequency","Zipf's law"),lty=c(NA,1),pch=c(8,NA),col=c("darkgreen","red"))

グラフ図

K,A: 定数の変更

K=50
A=0.5
rank <- seq(1:dim(freqMtx)[1])
zipf <- K/rank^A

コーパス言語学B: Lecture04 (Fall 2020)

Lecture4: データの並び替え, 条件文

[前回の課題] getFreqMtx.R

関数ファイルの読み込み

頻度順にソート

データサイズ

データの抽出（頻度数５以上）

View関数

データサイズ

配列の条件抽出

配列の条件抽出：行・列

配列の条件抽出：粗品度(7以上10より小さい)

配列の条件抽出：単語

配列の条件抽出：単語長

正規表現による単語検索

stringrパッケージ

“th”を含んでいる単語

先頭が“th”である単語

末尾が“th”である単語

先頭が“a”である単語

複数の条件

先頭が“a” あるいは(|) 末尾が“th”である単語

先頭が“t” で始まり 末尾が“th”である単語

calc2ndPower

実行

条件文1

分岐条件: 分類値

実行

条件文2

分岐条件: 真偽値

実行

（復習）棒グラフ

（復習）色付き棒グラフ

manipulate package

インタラクティブなプロット

色の選択

プロットマーカーの選択

プロットサイズの選択

manipulate関数の書き方（複数行の場合）

実習

課題1: 変数“freqMtx”に格納されているデータを使用して、以下の条件抽出をしなさい。

課題1-1:相対頻度数 0.01以上 0.04以下のデータ

課題1-2:先頭が“c”で始まり末尾が“ed”である単語

課題2:

（復習）棒グラフをmanipulate関数を使用し、引数lasとcolをインタラクティブ表示させなさい。

注意: lasに選択できる値は、1,2,3から選ぶ

実行例

余裕があれば…

Zipf’sの法則

グラフ図:Zipf’sの理論式

頻度散布図＆Zipf’sの理論式の重ね書き

凡例をつける: legend

グラフ図

K,A: 定数の変更

再描画

先頭が“t” で始まり末尾が“th”である単語