Warmup Practice

作業ディレクトリの確認

getwd()
[1] "/cloud/project"

リストの作成

c(1, 2, 3, 4, 5)
[1] 1 2 3 4 5

操作:2倍する

c(1, 2, 3, 4, 5)*2
[1]  2  4  6  8 10

変数に代入

代入演算子

Y <- c(1, 2, 3, 4, 5)

基本操作:2倍する

Y*2 
[1]  2  4  6  8 10

基本操作:2乗する

Y^2
[1]  1  4  9 16 25

要素の抽出

Y[4]
[1] 4

length関数: リストの長さ(要素数)

str <- c ("a", "ab", "abc")
length(str)
[1] 3

nchar関数: 文字の長さ

nchar(str)
[1] 1 2 3

sqrt関数: 平方根(squre root)を計算する

numLst <- c (16,25,256)
sqrt(numLst)
[1]  4  5 16

テキストの頻度表作成

サンプルテキスト

テキストファイルの読み込み

一行ずつ読み込んで、リストに格納

txt<-readLines("sample_texts/sample_en.txt")

結果出力

txt
[1] "COVID-19 is an infectious disease caused by a coronavirus called SARS-CoV-2. "                                                                                         
[2] "It mainly causes symptoms such as fever and/or cough. In general, it is spread through droplet and contact transmission. "                                             
[3] "It has been pointed out that it may spread before symptoms appear. "                                                                                                   
[4] "It is therefore important to habitually follow the general strategies for preventing infectious diseases, such as social distancing and wearing a mask when in public."

3行目の内容

txt[3] 
[1] "It has been pointed out that it may spread before symptoms appear. "

練習ファイルの読み込んだ行数を表示

[1] 4

スペース&記号による分割

Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
wordLst<-strsplit(txt,"[[:space:]]|[[:punct:]]")

結果出力

wordLst
[[1]]
 [1] "COVID"       "19"          "is"         
 [4] "an"          "infectious"  "disease"    
 [7] "caused"      "by"          "a"          
[10] "coronavirus" "called"      "SARS"       
[13] "CoV"         "2"           ""           

[[2]]
 [1] "It"           "mainly"       "causes"      
 [4] "symptoms"     "such"         "as"          
 [7] "fever"        "and"          "or"          
[10] "cough"        ""             "In"          
[13] "general"      ""             "it"          
[16] "is"           "spread"       "through"     
[19] "droplet"      "and"          "contact"     
[22] "transmission" ""            

[[3]]
 [1] "It"       "has"      "been"     "pointed" 
 [5] "out"      "that"     "it"       "may"     
 [9] "spread"   "before"   "symptoms" "appear"  
[13] ""        

[[4]]
 [1] "It"         "is"         "therefore" 
 [4] "important"  "to"         "habitually"
 [7] "follow"     "the"        "general"   
[10] "strategies" "for"        "preventing"
[13] "infectious" "diseases"   ""          
[16] "such"       "as"         "social"    
[19] "distancing" "and"        "wearing"   
[22] "a"          "mask"       "when"      
[25] "in"         "public"    

各行のデータを一括化

wordLst<-unlist(wordLst)

小文字に変換

wordLst<-tolower(wordLst)

結果出力

wordLst
 [1] "covid"        "19"           "is"          
 [4] "an"           "infectious"   "disease"     
 [7] "caused"       "by"           "a"           
[10] "coronavirus"  "called"       "sars"        
[13] "cov"          "2"            ""            
[16] "it"           "mainly"       "causes"      
[19] "symptoms"     "such"         "as"          
[22] "fever"        "and"          "or"          
[25] "cough"        ""             "in"          
[28] "general"      ""             "it"          
[31] "is"           "spread"       "through"     
[34] "droplet"      "and"          "contact"     
[37] "transmission" ""             "it"          
[40] "has"          "been"         "pointed"     
[43] "out"          "that"         "it"          
[46] "may"          "spread"       "before"      
[49] "symptoms"     "appear"       ""            
[52] "it"           "is"           "therefore"   
[55] "important"    "to"           "habitually"  
[58] "follow"       "the"          "general"     
[61] "strategies"   "for"          "preventing"  
[64] "infectious"   "diseases"     ""            
[67] "such"         "as"           "social"      
[70] "distancing"   "and"          "wearing"     
[73] "a"            "mask"         "when"        
[76] "in"           "public"      

空白”“の削除

#wordLst<-wordLst[nchar(wordLst)>0]
wordLst<- wordLst[wordLst != ""]

結果出力

wordLst
 [1] "covid"        "19"           "is"          
 [4] "an"           "infectious"   "disease"     
 [7] "caused"       "by"           "a"           
[10] "coronavirus"  "called"       "sars"        
[13] "cov"          "2"            "it"          
[16] "mainly"       "causes"       "symptoms"    
[19] "such"         "as"           "fever"       
[22] "and"          "or"           "cough"       
[25] "in"           "general"      "it"          
[28] "is"           "spread"       "through"     
[31] "droplet"      "and"          "contact"     
[34] "transmission" "it"           "has"         
[37] "been"         "pointed"      "out"         
[40] "that"         "it"           "may"         
[43] "spread"       "before"       "symptoms"    
[46] "appear"       "it"           "is"          
[49] "therefore"    "important"    "to"          
[52] "habitually"   "follow"       "the"         
[55] "general"      "strategies"   "for"         
[58] "preventing"   "infectious"   "diseases"    
[61] "such"         "as"           "social"      
[64] "distancing"   "and"          "wearing"     
[67] "a"            "mask"         "when"        
[70] "in"           "public"      

単語のToken数

tokens <- length(wordLst)

単語のTypes数

  • unique()関数は,リストの重複しない要素を返す
types <- length(unique(wordLst))

結果出力

print(paste("Token =", tokens))
[1] "Token = 71"
print(paste("Types =", types))
[1] "Types = 55"

TTR: Type-Token Ratioの計算

\[TTR=\frac{types}{tokens} \times 100 \]

types/tokens*100
[1] 77.46479
小数点2桁で結果を出力
round(types/tokens*100,2)
[1] 77.46

練習: Guiraud値(RTTR: Root Type-Token Ratio)を求める

\[RTTR=\frac{types}{\sqrt{tokens}} \]

小数点2桁で結果を出力

[1] 6.53

Word Frequencies

(freq <- table(wordLst))
wordLst
          19            2            a 
           1            1            2 
          an          and       appear 
           1            3            1 
          as         been       before 
           2            1            1 
          by       called       caused 
           1            1            1 
      causes      contact  coronavirus 
           1            1            1 
       cough          cov        covid 
           1            1            1 
     disease     diseases   distancing 
           1            1            1 
     droplet        fever       follow 
           1            1            1 
         for      general   habitually 
           1            2            1 
         has    important           in 
           1            1            2 
  infectious           is           it 
           2            3            5 
      mainly         mask          may 
           1            1            1 
          or          out      pointed 
           1            1            1 
  preventing       public         sars 
           1            1            1 
      social       spread   strategies 
           1            2            1 
        such     symptoms         that 
           2            2            1 
         the    therefore      through 
           1            1            1 
          to transmission      wearing 
           1            1            1 
        when 
           1 

Sort

(freq_data<-sort(freq, decreasing=TRUE))
wordLst
          it          and           is 
           5            3            3 
           a           as      general 
           2            2            2 
          in   infectious       spread 
           2            2            2 
        such     symptoms           19 
           2            2            1 
           2           an       appear 
           1            1            1 
        been       before           by 
           1            1            1 
      called       caused       causes 
           1            1            1 
     contact  coronavirus        cough 
           1            1            1 
         cov        covid      disease 
           1            1            1 
    diseases   distancing      droplet 
           1            1            1 
       fever       follow          for 
           1            1            1 
  habitually          has    important 
           1            1            1 
      mainly         mask          may 
           1            1            1 
          or          out      pointed 
           1            1            1 
  preventing       public         sars 
           1            1            1 
      social   strategies         that 
           1            1            1 
         the    therefore      through 
           1            1            1 
          to transmission      wearing 
           1            1            1 
        when 
           1 

ファイルに出力

write.csv(freq_data, "freq_en.csv")

単語頻度数分布(色付き)

las: label style

colors = c("orange", "lightblue", "green") 
barplot(freq_data, las=3,col=colors)

日本語のテキスト (sample_ja_1.txt)で単語の頻度データを作成

テキストファイルの読み込み

一行ずつ読み込んで、リストに格納

txt<-readLines("sample_texts/sample_ja_1.txt")

頻度表

wordLst<-strsplit(txt,"[[:space:]]|[[:punct:]]")
wordLst<-unlist(wordLst)
wordLst<- wordLst[wordLst != ""]
freq <- table(wordLst)
(freq_data<-sort(freq, decreasing=TRUE))
wordLst
        に         を         と         が 
        12          8          7          6 
        の         は   ください         て 
         6          6          5          5 
        で       とき       ない       ます 
         5          5          5          5 
         1          9      COVID   ウイルス 
         4          4          4          4 
    コロナ         話         か         し 
         4          4          3          3 
        や         中         人         体 
         3          3          3          3 
         2        CoV       SARS       つけ 
         2          2          2          2 
      など       なり     マスク     まわり 
         2          2          2          2 
      よう         出       出る         咳 
         2          2          2          2 
        外       感染         熱         誰 
         2          2          2          2 
      近く distancing     Social       あと 
         2          1          1          1 
      あり       ある       いい     うつさ 
         1          1          1          1 
    うつし       から       こと     しかし 
         1          1          1          1 
    しまう       しれ     そして         た 
         1          1          1          1 
        だ     つける       なら       なる 
         1          1          1          1 
      ませ         も         ん       入っ 
         1          1          1          1 
      入る       接触         気       空け 
         1          1          1          1 
      話す         間       飛沫 
         1          1          1 
TTR(小数点2桁)
tokens <- length(wordLst)
types <- length(unique(wordLst))
round(types/tokens*100,2)
[1] 39.23

実習:  日本語のテキスト (sample_ja_2.txt)を使用し、

  • 単語の頻度表の作成
  • TTR, Guiraud値の計算 をしてください。

単語の頻度表

TTR(小数点2桁)
Guiraud値(小数点2桁)
LS0tCnRpdGxlOiAiTGVjMDI6IOWfuuacrOaTjeS9nCIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBXYXJtdXAgUHJhY3RpY2UKIyMg5L2c5qWt44OH44Kj44Os44Kv44OI44Oq44Gu56K66KqNIApgYGB7cn0KZ2V0d2QoKQpgYGAKIyMg44Oq44K544OI44Gu5L2c5oiQCmBgYHtyfQpjKDEsIDIsIDMsIDQsIDUpCmBgYAoKIyMg5pON5L2c77yaMuWAjeOBmeOCiwpgYGB7cn0KYygxLCAyLCAzLCA0LCA1KSoyCmBgYAoKIyMg5aSJ5pWw44Gr5Luj5YWlCiMjIyA8YSBocmVmPSJodHRwczovL3N0YXQuZXRoei5jaC9SLW1hbnVhbC9SLWRldmVsL2xpYnJhcnkvYmFzZS9odG1sL2Fzc2lnbk9wcy5odG1sIiB0YXJnZXQ9Il9ibGFuayI+5Luj5YWl5ryU566X5a2QPC9hPgpgYGB7cn0KWSA8LSBjKDEsIDIsIDMsIDQsIDUpCmBgYAoKIyMg5Z+65pys5pON5L2c77yaMuWAjeOBmeOCiwpgYGB7cn0KWSoyIApgYGAKCiMjIOWfuuacrOaTjeS9nO+8mjLkuZfjgZnjgosKYGBge3J9ClleMgpgYGAKCiMjIOimgee0oOOBruaKveWHugpgYGB7cn0KWVs0XQpgYGAKCiMjIyBsZW5ndGjplqLmlbA6IOODquOCueODiOOBrumVt+OBle+8iOimgee0oOaVsO+8iQpgYGB7cn0Kc3RyIDwtIGMgKCJhIiwgImFiIiwgImFiYyIpCmxlbmd0aChzdHIpCmBgYAojIyMgbmNoYXLplqLmlbA6IOaWh+Wtl+OBrumVt+OBlQpgYGB7cn0KbmNoYXIoc3RyKQpgYGAKIyMjIHNxcnTplqLmlbDvvJog5bmz5pa55qC5KHNxdXJlIHJvb3Qp44KS6KiI566X44GZ44KLCmBgYHtyfQpudW1Mc3QgPC0gYyAoMTYsMjUsMjU2KQpzcXJ0KG51bUxzdCkKYGBgCgojIOODhuOCreOCueODiOOBrumgu+W6puihqOS9nOaIkAojIyDjgrXjg7Pjg5fjg6vjg4bjgq3jgrnjg4gKLSA8YSBocmVmPSJodHRwczovL3d3dy5jMTkubWhsdy5nby5qcC8iIHRhcmdldD0iX2JsYW5rIj7ljprnlJ/lirTlg43nnIFIUO+8mkNPVklELTE544Gr44Gk44GE44GmPC9hPgotIOWNmOiqnuOBruWIhuOBi+OBoeabuOOBjQogIC0g5pel5pys6Kqe77yabWVjYWIgamFfMT0idW5pZGljIiwgamFfMj0iaXBhZGljLW5lb2xvZ2QiCiAgLSDkuK3mlofvvJo8YSBocmVmPSJodHRwczovL2NvcmVubHAucnVuLyIgdGFyZ2V0PSJfYmxhbmsiPkNvcmVOTFA8L2E+CgotIDxhIGhyZWY9Imh0dHBzOi8vZW5naW5lZXJpbmcubGluZWNvcnAuY29tL2phL2Jsb2cvbWVjYWItaXBhZGljLW5lb2xvZ2QtbmV3LXdvcmRzLWFuZC1leHByZXNzaW9ucy8iIHRhcmdldD0iX2JsYW5rIj5tZWNhYi1pcGFkaWMtbmVvbG9nZOOCkuS9v+OBo+OBn+S+izwvYT4KCiMjIOODhuOCreOCueODiOODleOCoeOCpOODq+OBruiqreOBv+i+vOOBvwrkuIDooYzjgZrjgaToqq3jgb/ovrzjgpPjgafjgIHjg6rjgrnjg4jjgavmoLzntI0KYGBge3J9CnR4dDwtcmVhZExpbmVzKCJzYW1wbGVfdGV4dHMvc2FtcGxlX2VuLnR4dCIpCmBgYAoKIyMjIOe1kOaenOWHuuWKmwpgYGB7cn0KdHh0CmBgYAoKIyMjIyAz6KGM55uu44Gu5YaF5a65CmBgYHtyfQp0eHRbM10gCmBgYAoKIyMjIyA8c3BhbiBzdHlsZT0iY29sb3I6IGJsdWU7ICI+57e057+SPC9zcGFuPuODleOCoeOCpOODq+OBruiqreOBv+i+vOOCk+OBoOihjOaVsOOCkuihqOekugpgYGB7ciwgZWNobz1GQUxTRX0KbGVuZ3RoKHR4dCkKYGBgCgojIyMjIOOCueODmuODvOOCuSboqJjlj7fjgavjgojjgovliIblibIKYGBgClB1bmN0dWF0aW9uIGNoYXJhY3RlcnM6CiEgIiAjICQgJSAmICcgKCApICogKyAsIC0gLiAvIDogOyA8ID0gPiA/IEAgWyBcIF0gXiBfIGAgeyB8IH0gfi4KYGBgCi0gPGEgaHJlZj0iaHR0cHM6Ly9qYS53aWtpcGVkaWEub3JnL3dpa2kvJUU2JUFEJUEzJUU4JUE2JThGJUU4JUExJUE4JUU3JThGJUJFIiB0YXJnZXQ9Il9ibGFuayI+5q2j6KaP6KGo54++PC9hPgoKYGBge3J9CndvcmRMc3Q8LXN0cnNwbGl0KHR4dCwiW1s6c3BhY2U6XV18W1s6cHVuY3Q6XV0iKQpgYGAKCiMjIyDntZDmnpzlh7rlipsKYGBge3J9CndvcmRMc3QKYGBgCgojIyMjIOWQhOihjOOBruODh+ODvOOCv+OCkuS4gOaLrOWMlgpgYGB7cn0Kd29yZExzdDwtdW5saXN0KHdvcmRMc3QpCmBgYAoKIyMjIyDlsI/mloflrZfjgavlpInmj5sKYGBge3J9CndvcmRMc3Q8LXRvbG93ZXIod29yZExzdCkKYGBgCgojIyMg57WQ5p6c5Ye65YqbCmBgYHtyfQp3b3JkTHN0CmBgYAoKIyMjIyDnqbrnmb0iIuOBruWJiumZpApgYGB7cn0KI3dvcmRMc3Q8LXdvcmRMc3RbbmNoYXIod29yZExzdCk+MF0Kd29yZExzdDwtIHdvcmRMc3Rbd29yZExzdCAhPSAiIl0KYGBgCgojIyMg57WQ5p6c5Ye65YqbCmBgYHtyfQp3b3JkTHN0CmBgYAoKIyMjIOWNmOiqnuOBrlRva2Vu5pWwCmBgYHtyfQp0b2tlbnMgPC0gbGVuZ3RoKHdvcmRMc3QpCmBgYAoKIyMjIOWNmOiqnuOBrlR5cGVz5pWwCiogdW5pcXVlKCnplqLmlbDjga/vvIzjg6rjgrnjg4jjga7ph43opIfjgZfjgarjgYTopoHntKDjgpLov5TjgZkKYGBge3J9CnR5cGVzIDwtIGxlbmd0aCh1bmlxdWUod29yZExzdCkpCmBgYAoKIyMjIOe1kOaenOWHuuWKmwpgYGB7cn0KcHJpbnQocGFzdGUoIlRva2VuID0iLCB0b2tlbnMpKQpwcmludChwYXN0ZSgiVHlwZXMgPSIsIHR5cGVzKSkKYGBgCgojIyMgVFRSOiBUeXBlLVRva2VuIFJhdGlv44Gu6KiI566XCiQkVFRSPVxmcmFje3R5cGVzfXt0b2tlbnN9IFx0aW1lcyAxMDAgJCQKCmBgYHtyfQp0eXBlcy90b2tlbnMqMTAwCmBgYAoKIyMjIyMg5bCP5pWw54K5MuahgeOBp+e1kOaenOOCkuWHuuWKmwpgYGB7cn0Kcm91bmQodHlwZXMvdG9rZW5zKjEwMCwyKQpgYGAKCiMjIDxzcGFuIHN0eWxlPSJjb2xvcjogYmx1ZTsgIj7nt7Tnv5I8L3NwYW4+OiBHdWlyYXVk5YCkKFJUVFI6IFJvb3QgVHlwZS1Ub2tlbiBSYXRpbynjgpLmsYLjgoHjgosKJCRSVFRSPVxmcmFje3R5cGVzfXtcc3FydHt0b2tlbnN9fSAkJAoKIyMjIyDlsI/mlbDngrky5qGB44Gn57WQ5p6c44KS5Ye65YqbCmBgYHtyLCBlY2hvPUZBTFNFfQpyb3VuZCh0eXBlcy9zcXJ0KHRva2VucyksMikKYGBgCgojIyBXb3JkIEZyZXF1ZW5jaWVzCmBgYHtyfQooZnJlcSA8LSB0YWJsZSh3b3JkTHN0KSkKYGBgCgojIyBTb3J0CmBgYHtyfQooZnJlcV9kYXRhPC1zb3J0KGZyZXEsIGRlY3JlYXNpbmc9VFJVRSkpCmBgYAoKIyMg44OV44Kh44Kk44Or44Gr5Ye65YqbCmBgYHtyfQp3cml0ZS5jc3YoZnJlcV9kYXRhLCAiZnJlcV9lbi5jc3YiKQpgYGAKCgojIyMg5Y2Y6Kqe6aC75bqm5pWw5YiG5biDKOiJsuS7mOOBjSkKIyMjIyA8YSBocmVmPSJodHRwOi8vY3NlLm5hcm8uYWZmcmMuZ28uanAvdGFrZXphd2Evci10aXBzL3IvNTMuaHRtbCIgdGFyZ2V0PSJfYmxhbmsiPmxhczogbGFiZWwgc3R5bGU8L2E+CmBgYHtyfQpjb2xvcnMgPSBjKCJvcmFuZ2UiLCAibGlnaHRibHVlIiwgImdyZWVuIikgCmJhcnBsb3QoZnJlcV9kYXRhLCBsYXM9Myxjb2w9Y29sb3JzKQpgYGAKCiMjIOaXpeacrOiqnuOBruODhuOCreOCueODiCAoc2FtcGxlX2phXzEudHh0KeOBp+WNmOiqnuOBrumgu+W6puODh+ODvOOCv+OCkuS9nOaIkAojIyDjg4bjgq3jgrnjg4jjg5XjgqHjgqTjg6vjga7oqq3jgb/ovrzjgb8K5LiA6KGM44Ga44Gk6Kqt44G/6L6844KT44Gn44CB44Oq44K544OI44Gr5qC857SNCmBgYHtyfQp0eHQ8LXJlYWRMaW5lcygic2FtcGxlX3RleHRzL3NhbXBsZV9qYV8xLnR4dCIpCmBgYAoKIyMjIOmgu+W6puihqApgYGB7cn0Kd29yZExzdDwtc3Ryc3BsaXQodHh0LCJbWzpzcGFjZTpdXXxbWzpwdW5jdDpdXSIpCndvcmRMc3Q8LXVubGlzdCh3b3JkTHN0KQp3b3JkTHN0PC0gd29yZExzdFt3b3JkTHN0ICE9ICIiXQpmcmVxIDwtIHRhYmxlKHdvcmRMc3QpCihmcmVxX2RhdGE8LXNvcnQoZnJlcSwgZGVjcmVhc2luZz1UUlVFKSkKYGBgCgojIyMjIyBUVFLvvIjlsI/mlbDngrky5qGB77yJCmBgYHtyfQp0b2tlbnMgPC0gbGVuZ3RoKHdvcmRMc3QpCnR5cGVzIDwtIGxlbmd0aCh1bmlxdWUod29yZExzdCkpCnJvdW5kKHR5cGVzL3Rva2VucyoxMDAsMikKYGBgCgoKIyMgPHNwYW4gc3R5bGU9ImNvbG9yOiBibHVlOyAiPuWun+e/kjwvc3Bhbj46IOOAgOaXpeacrOiqnuOBruODhuOCreOCueODiCAoc2FtcGxlX2phXzIudHh0KeOCkuS9v+eUqOOBl+OAgQotIOWNmOiqnuOBrumgu+W6puihqOOBruS9nOaIkAotIFRUUiwgR3VpcmF1ZOWApOOBruioiOeulwrjgpLjgZfjgabjgY/jgaDjgZXjgYTjgIIKCiMjIyDljZjoqp7jga7poLvluqbooagKYGBge3IsIGVjaG89RkFMU0V9CnR4dDwtcmVhZExpbmVzKCJzYW1wbGVfdGV4dHMvc2FtcGxlX2phXzIudHh0IikKd29yZExzdDwtc3Ryc3BsaXQodHh0LCJbWzpzcGFjZTpdXXxbWzpwdW5jdDpdXSIpCndvcmRMc3Q8LXVubGlzdCh3b3JkTHN0KQp3b3JkTHN0PC0gd29yZExzdFt3b3JkTHN0ICE9ICIiXQpmcmVxIDwtIHRhYmxlKHdvcmRMc3QpCmBgYAoKIyMjIyMgVFRS77yI5bCP5pWw54K5Muahge+8iQpgYGB7ciwgZWNobz1GQUxTRX0KdG9rZW5zIDwtIGxlbmd0aCh3b3JkTHN0KQp0eXBlcyA8LSBsZW5ndGgodW5pcXVlKHdvcmRMc3QpKQpUVFIgPC0gcm91bmQodHlwZXMvdG9rZW5zKjEwMCwyKQpgYGAKIyMjIyMgR3VpcmF1ZOWApO+8iOWwj+aVsOeCuTLmoYHvvIkKYGBge3IsIGVjaG89RkFMU0V9ClJUVFIgPC0gcm91bmQodHlwZXMvc3FydCh0b2tlbnMpLDIpCmBgYAo=