Stringr 함수 II
텍스트 전처리 과정에서 정규 표현 역할 이해하기
리터럴 문자와 메타 문자 차이 이해하기
메타 문자 이스케이프 방법 이해하기
문자 집합 (범위 구성) 파악하기
문자 클래스를 활용한 정규 표현식 구성 연습

이번 시간에는 지금까지 stringr 함수와 정규표현식을 이용해서 텍스트 전처리 및 토큰화 실습

위키피디아 텍스트의 어휘 빈도수 재분석

자 지금까지 배운 정규표현식으로 지난 시간에 수행했던 위키피디아 페이지의 텍스트를 다시 한번 처리해 보도록 하겠습니다.

library(pdftools)
library(stringr)
bts_text <- pdf_text("BTS_(band).pdf")
bts_string <- str_c(bts_text, collapse = " ") # 문자 벡터 bts_text를 단일 문자열로 연결

이제 BTS에 관한 위키피디아 페이지의 텍스트가 하나로 연결된 단일 문자열을 가지게 되었습니다.

문자열을 전처리해봅시다.

첫 번째로, References 섹션의 모든 것을 제거합시다.

str_locate_all(tolower(bts_string), "references") # 문자열에서 "references" 패턴의 위치를 찾습니다.

## [[1]]
##      start   end
## [1,]  7027  7036
## [2,] 32170 32179

bts_trunc <- str_trunc(bts_string, 32169, side = "right") # 문자열을 줄일 수 있음. 해당 위치에서 오른쪽의 문자열을 삭제.

이제 문자 “bts_string”에 "references"라는 정규 표현식이 어디에 나타나는지 알아보고 정규 표현식 패턴의 위치 이후의 모든 것을 제거하여 잘라냅니다.

`str_trunc` 함수가 오류를 일으킬 경우

bts_string_line <- unlist(str_split(bts_string, "\n"))
str_which(bts_string_line, "References")
bts_trunc <- str_c(bts_string_line[1:306], collapse=" ")

다음으로는 공백(/n 또는 /r/n 또는 여러 공백)을 처리해야합니다.

줄 바꿈을 포함하여 모든 공백 문자를 나타내는 POSIX 문자 클래스를 기억하세요. : [[:space:]]

bts_nospace <- str_replace_all(bts_trunc, "[[:space:]]{1,}", " ") # 하나 이상의 공백을 하나의 공백 문자로 바꾸십시오.
# {1,}이 정규 표현식에서 어떤 역할을 하는지 생각해봅시다.

이전보다 나아보일 것입니다. 문자열 객체로 이제 무엇을 해야할까요? 문자열에 영어 이외의 문자가 있는지 확인해 봅시다. POSIX 문자 클래스 [:ascii:]를 사용하여 모든 비알파벳 문자를 제거해주어야 합니다. 자 그리고, 남은 모든 알파벳 문자를 표준화(소문자 또는 대문자로 표준화) 처리해야 합니다. 따라서 문자열의 문자를 소문자로 변환하는데 ‘tolower’ 함수를 사용해야합니다.

str_extract_all(bts_nospace, "[^[:ascii:]]{1,}") # 모든 영어 이외의 문자 추출(선행 문자 집합과 적어도 1번 이상 일치)

## [[1]]
##  [1] "·"                     "·"                    
##  [3] "·"                     "·"                    
##  [5] "·"                     "·"                    
##  [7] "<U+2013>"               "·"                    
##  [9] "·"                     "·"                    
## [11] "彈少年團"               "<U+2013>"              
## [13] "<U+01D2>"               "<U+5F3E>少年<U+56E3>"  
## [15] "ぼうだんしょうねんだん" "<U+014D>"              
## [17] "<U+014D>"               "o"                     
## [19] "o"                      "<U+2013>"              
## [21] "<U+2013>"               "<U+2013>"              
## [23] "彈少年團"               "’"                    
## [25] "<U+014D>"               "<U+014D>"              
## [27] "<U+5F3E>少年<U+56E3>"   "<U+2013>"              
## [29] "<U+2013>"               "<U+2013>"              
## [31] "<U+2013>"               "<U+2013>"              
## [33] "”"                     "’"                    
## [35] "“"                     "”"                    
## [37] "’"                     "“"                    
## [39] "”"                     "<U+2013>"              
## [41] "“"                     "’"                    
## [43] "”"                     "\\"                    
## [45] "\\"                     "\\"                    
## [47] "\\"                     "\\"                    
## [49] "<U+2013>"               "<U+2013>"              
## [51] "<U+2013>"               "<U+2013>"              
## [53] "<U+2013>"               "<U+2013>"              
## [55] "<U+2013>"               "<U+2013>"              
## [57] "<U+2013>"               "<U+2013>"              
## [59] "<U+2013>"               "<U+2013>"              
## [61] "<U+2013>"               "<U+2013>"              
## [63] "<U+2013>"               "<U+2013>"              
## [65] "<U+2013>"               "<U+2013>"              
## [67] "<U+2013>"               "<U+2013>"              
## [69] "<U+2013>"               "<U+2013>"              
## [71] "<U+2013>"               "<U+2013>"              
## [73] "<U+2013>"

bts_eng <- str_replace_all(bts_nospace, "[^[:ascii:]]+", " ") # 영어 이외의 문자를 " " 문자로 바꾸세요. 왜 알파벳 이외의 문자가 1회 이상 등장한 문자열을 제거("") 대신 빈칸 하나로 바꿀까요? 
bts_eng_lower <- tolower(bts_eng) # Translate all characters into lower-case letters
# 오류 메세지가 뜨면 str_to_lower 문자열 함수를 사용하세요.

bts_tidy <- str_to_lower(bts_tidy)

이제 구두점과 숫자를 처리하는 방법에 대해 생각해봅시다.

# 제거할 구두점을 확인하세요.
unlist(str_extract_all(bts_eng_lower, "[[:punct:]]+"))[1:100] # '+'가 정규식에서 어떤 역할을 하는지 생각해봅시다.

##   [1] "("    ")"    "("    ":"    ";"    ":"    "),"   ","    "-"    "."   
##  [11] ","    "\""   "\""   "."    ","    "."    "&"    "("    "),"   ","   
##  [21] "("    ")"    ":"    "("    "),"   "."    "."    ".["   "]"    ":"   
##  [31] ","    ":"    ","    ","    ","    ","    ","    ".["   "]"    "-"   
##  [41] ","    "("    "),["  "]"    ","    "-"    ".["   "]"    ","    ".["  
##  [51] "]"    "."    ","    "'"    ","    "\""   "\",[" "]"    ".["   "]"   
##  [61] "'"    "-"    "["    "]"    "&"    ","    ":"    "("    "),["  "]"   
##  [71] ",["   "]"    "."    "["    "]"    "'"    ","    "\""   "\","  "["   
##  [81] "]"    ".["   "]"    "."    "."    ","    "\""   "\","  "."    ","   
##  [91] ".["   "]"    "-"    "."    "'"    ","    "'"    ","    "."    "."

텍스트에서 모든 구두점이 추출된 것 같습니다.

그러나 단어를 형성하는데 포함되어 있는 구두점은 어떻게 해야할까요? 무엇일까요? 예를 들면…

# 제거할 구두점을 확인하세요. 
unlist(str_extract_all(bts_eng_lower, "[[:graph:]]*[[:punct:]]{1,}[[:graph:]]+"))[1:100] # "*" 는 선행 패턴과 0번 이상 일치합니다.

##   [1] "(band)"               "(hangul:"             "sonyeondan),"        
##   [4] "seven-member"         "\"no"                 "(2014),"             
##   [7] "(2015)"               "(2016),"              "u.s."                
##  [10] "200.[4]"              "awards.[5]"           "j-hope"              
##  [13] "(2016),[6]"           "k-pop"                "ever.[7]"            
##  [16] "time.[8]"             "1.5"                  "\"million"           
##  [19] "seller\",[9]"         "awards.[10]"          "group's"             
##  [22] "k-pop"                "hop[1]"               "r&b"                 
##  [25] "(2017),[11]"          "200,[12]"             "japan[2]"            
##  [28] "album's"              "\"dna\","             "columbia[3]"         
##  [31] "67.[13]"              "bts.ibighit.com"      "\"mic"               
##  [34] "drop\","              "ever.[14]"            "j-hope"              
##  [37] "1.2"                  "korea's"              "chart's"             
##  [40] "g.o.d's"              "2001.[15]"            "2017.[16]"           
##  [43] "(2018),"              "200,[17]"             "k-pop"               
##  [46] "worldwide.[18][19]"   "2016.[20]"            "k-pop"               
##  [49] "pangt'an"             "may.[21]"             "chart.[22]"          
##  [52] "chart.[23]"           "youtube's"            "chart.[24]"          
##  [55] "abbma.[25][26]"       "kunrei-shiki"         "internet.[27]"       
##  [58] "\"having"             "world's"              "group\".[28]"        
##  [61] "\"liked"              "(502"                 "million)\""          
##  [64] "u.s."                 "combined.[29]"        "[30]"                
##  [67] "korea's"              "group's"              "(hangul:"            
##  [70] "),"                   "\"bulletproof"        "scouts\"."           
##  [73] "adolescents.[31][32]" "),"                   "similarly.[33]"      
##  [76] "\"beyond"             "identity.[34]"        "\"growing"           
##  [79] "forward.\"[35]"       "j-hope,"              "soundcloud.[36][37]" 
##  [82] "group's"              "\"school"             "\"no"                
##  [85] "2013.[38][39]"        "105,000"              "copies,[40][41]"     
##  [88] "\"no"                 "\"we"                 "hits.[42][43]"       
##  [91] "\"no"                 "re-recorded"          "2014.[44]"           
##  [94] "\"school"             "o!rul8,2?,"           "120,000"             
##  [97] "four.[45][46]"        "\"n.o\""              "\"attack"            
## [100] "(korean:"

위의 단어 목록을 참고하여 보면, 구두점을 삭제할 때 주의해야 할 점을 알 수 있습니다. 첫째, “’”만 제거하는 것보다는 “’s”를 같이 제거해주는 것이 좋습니다. 둘째, “u.s”, “r&b”는 “usa”나 “rnb”로 바꾸어 주는 것이 좋습니다. 마지막으로, "o!rul8,2?"는 그 자체로는 의미를 가지지만 편의상 여기서는 구두점과 숫자의 기능을 무시합니다.

unlist(str_extract_all(bts_eng_lower, "[[:alpha:]]+['][sS] ")) # 패턴을 바꾸기 전에 원하는 패턴과 일치하는지 항상 먼저 확인하세요.

##  [1] "group's "       "album's "       "korea's "       "chart's "      
##  [5] "d's "           "youtube's "     "world's "       "korea's "      
##  [9] "group's "       "group's "       "mtv's "         "onbillboard's "
## [13] "infuse's "      "mtv's "         "billboard's "   "oricon's "     
## [17] "group's "       "group's "       "melon's "       "billboard's "  
## [21] "group's "       "clark's "       "year's "        "show's "       
## [25] "asahi's "       "group's "       "band's "        "group's "      
## [29] "group's "       "v's "           "hope's "        "suga's "       
## [33] "rm's "          "group's "       "group's "       "puma's "       
## [37] "allets's "      "let's "         "wonder's "      "school's "     
## [41] "bts's "         "bts's "

bts_noapos <- str_replace_all(bts_eng_lower, "['][sS] ", " ") # 패턴을 하나의 공백 문자로 바꾸세요. 
str_extract_all(bts_noapos, " u\\.s\\. | r\\&b ")

## [[1]]
## [1] " u.s. " " r&b "  " u.s. " " r&b "

bts_usa <- str_replace_all(bts_noapos,  " u\\.s\\. ", " usa ")
bts_rnb <- str_replace_all(bts_usa,  " r\\&b ", " rnb ")
str_extract_all(bts_rnb, " usa | rnb ")

## [[1]]
## [1] " usa " " rnb " " usa " " rnb "

위키피디아 페이지에는 [digits] 또는 (digits) 형태의 인용 표시가 많습니다.

str_trunc(bts_rnb, 1000)

## [1] "bts (band) bts (hangul: ; rr: bangtan sonyeondan), also known as bts the bangtan boys, is a seven-member south korean boy band formed by big hit entertainment. they debuted on june 12, 2013 with the song \"no more dream\" from their first album 2 cool 4 skool. they won several new artist of the year awards for the track, including at the 2013 melon music awards and golden disc awards and the 2014 seoul music awards. the band continued to rise to widespread prominence with their subsequent albums dark & wild (2014), the most beautiful moment in life, part 2 (2015) and the most beautiful moment in life: young forever (2016), with the latter two entering the bts at the 32nd golden disk awards on usa billboard 200.[4] the most beautiful moment in life: young january 10, 2018 forever went on to win the album of the year award at the 2016 from left to right: v, suga, jin, jungkook, rm, melon music awards.[5] jimin and j-hope their second full album, wings (2016),[6] peaked at number 26 on t..."

이러한 인용 표시를 제거해보겠습니다.

str_extract_all(bts_rnb, "\\[\\d+\\]|\\(\\d+\\)")

## [[1]]
##   [1] "(2014)" "(2015)" "(2016)" "[4]"    "[5]"    "(2016)" "[6]"   
##   [8] "[7]"    "[8]"    "[9]"    "[10]"   "[1]"    "(2017)" "[11]"  
##  [15] "[12]"   "[2]"    "[3]"    "[13]"   "[14]"   "[15]"   "[16]"  
##  [22] "(2018)" "[17]"   "[18]"   "[19]"   "[20]"   "[21]"   "[22]"  
##  [29] "[23]"   "[24]"   "[25]"   "[26]"   "[27]"   "[28]"   "[29]"  
##  [36] "[30]"   "[31]"   "[32]"   "[33]"   "[34]"   "[35]"   "[36]"  
##  [43] "[37]"   "[38]"   "[39]"   "[40]"   "[41]"   "[42]"   "[43]"  
##  [50] "[44]"   "[45]"   "[46]"   "[47]"   "[48]"   "[49]"   "(2014)"
##  [57] "[50]"   "[51]"   "[52]"   "[53]"   "[54]"   "[55]"   "[56]"  
##  [64] "[57]"   "[58]"   "[59]"   "(2014)" "[60]"   "[61]"   "[62]"  
##  [71] "[63]"   "[64]"   "[65]"   "(2015)" "[66]"   "[67]"   "[68]"  
##  [78] "[69]"   "[70]"   "[71]"   "[71]"   "[72]"   "[73]"   "[74]"  
##  [85] "[63]"   "[75]"   "[76]"   "[77]"   "[78]"   "[79]"   "[80]"  
##  [92] "(2016)" "[81]"   "[82]"   "[83]"   "[84]"   "[5]"    "[85]"  
##  [99] "[86]"   "[87]"   "[88]"   "[89]"   "[90]"   "[91]"   "[10]"  
## [106] "[92]"   "[93]"   "[94]"   "[95]"   "[96]"   "[97]"   "[98]"  
## [113] "[99]"   "[100]"  "[101]"  "[102]"  "[103]"  "[104]"  "[105]" 
## [120] "[106]"  "[107]"  "[108]"  "[109]"  "[110]"  "[111]"  "[112]" 
## [127] "[113]"  "[114]"  "[115]"  "[116]"  "[117]"  "[118]"  "[119]" 
## [134] "[120]"  "[121]"  "[122]"  "[123]"  "[124]"  "[125]"  "[14]"  
## [141] "[126]"  "[127]"  "[128]"  "[129]"  "[130]"  "[131]"  "[132]" 
## [148] "[133]"  "[134]"  "[135]"  "[136]"  "[137]"  "[138]"  "[139]" 
## [155] "[140]"  "[141]"  "[142]"  "[143]"  "[144]"  "[145]"  "[146]" 
## [162] "[147]"  "[148]"  "[149]"  "[150]"  "[151]"  "[152]"  "[153]" 
## [169] "(2013)" "(2013)" "(2014)" "[154]"  "[155]"  "(2016)" "[156]" 
## [176] "[154]"  "[157]"  "[158]"  "[156]"  "[159]"  "[160]"  "[161]" 
## [183] "[25]"   "[26]"   "[162]"  "[29]"   "[163]"  "[164]"  "[165]" 
## [190] "[166]"  "[167]"  "[168]"  "[169]"  "[170]"  "[171]"  "[172]" 
## [197] "[173]"  "[174]"  "[175]"  "[176]"  "[177]"  "[178]"  "[179]" 
## [204] "[180]"  "[181]"  "[182]"  "[183]"  "[184]"  "[185]"  "[186]" 
## [211] "[187]"  "[188]"  "[189]"  "[190]"  "[191]"  "[192]"  "[193]" 
## [218] "[194]"  "[195]"  "[196]"  "[197]"  "[198]"  "[199]"  "[200]" 
## [225] "[201]"  "[202]"  "[203]"  "[204]"  "[205]"  "[206]"  "[207]" 
## [232] "[206]"  "[208]"  "[206]"  "[209]"  "[206]"  "[210]"  "[206]" 
## [239] "[211]"  "[206]"  "[212]"  "[206]"  "[213]"  "(2014)" "(2014)"
## [246] "(2016)" "(2016)" "(2018)" "(2018)" "(2014)" "(2015)" "(2015)"
## [253] "(2015)" "(2016)" "(2017)" "(2018)" "[214]"  "[215]"  "[216]"

bts_nocite <- str_replace_all(bts_rnb, "\\[[[:digit:]]+\\]|\\([[:digit:]]+\\)", "")

이제 문자열에서 모든 구두점을 제거할 준비가 되었습니다.

unlist(str_extract_all(bts_nocite, "[[:graph:]]*[[:punct:]]{1,}[[:graph:]]*"))[1:100]

##   [1] "(band)"          "(hangul:"        ";"              
##   [4] "rr:"             "sonyeondan),"    "boys,"          
##   [7] "seven-member"    "entertainment."  "12,"            
##  [10] "\"no"            "dream\""         "skool."         
##  [13] "track,"          "awards."         "&"              
##  [16] ","               "life,"           "life:"          
##  [19] ","               "200."            "life:"          
##  [22] "10,"             "right:"          "v,"             
##  [25] "suga,"           "jin,"            "jungkook,"      
##  [28] "rm,"             "awards."         "j-hope"         
##  [31] "album,"          ","               "200,"           
##  [34] "k-pop"           "ever."           "korea,"         
##  [37] "time."           "1.5"             "copies,"        
##  [40] "bts'"            "seoul,"          "\"million"      
##  [43] "seller\","       "awards."         "k-pop"          
##  [46] "release,"        "yourself:"       ","              
##  [49] "200,"            "history."        "track,"         
##  [52] "\"dna\","        "67."             "bts.ibighit.com"
##  [55] "album,"          "\"mic"           "drop\","        
##  [58] "100."            "america,"        "ever."          
##  [61] "j-hope"          "1.2"             "month,"         
##  [64] "years,"          "g.o.d"           "2001."          
##  [67] "2017."           "yourself:"       ","              
##  [70] "200,"            "k-pop"           "far."           
##  [73] "debut,"          "worldwide."      "presence,"      
##  [76] "2016."           "that,"           "k-pop"          
##  [79] "pangt'an"        "may."            "2016,"          
##  [82] "chart,"          "chart."          "date,"          
##  [85] "chart."          "year,"           "100:"           
##  [88] "chart,"          "chart."          "2017,"          
##  [91] "awards,"         "abbma."          "2017,"          
##  [94] "kunrei-shiki"    "internet."       "20,"            
##  [97] "\"having"        "group\"."        "december,"      
## [100] "2017,"

bts_nopunct <- str_replace_all(bts_nocite, "[[:punct:]^]+", "")
str_trunc(bts_nopunct, 1000)

## [1] "bts band bts hangul  rr bangtan sonyeondan also known as bts the bangtan boys is a sevenmember south korean boy band formed by big hit entertainment they debuted on june 12 2013 with the song no more dream from their first album 2 cool 4 skool they won several new artist of the year awards for the track including at the 2013 melon music awards and golden disc awards and the 2014 seoul music awards the band continued to rise to widespread prominence with their subsequent albums dark  wild  the most beautiful moment in life part 2  and the most beautiful moment in life young forever  with the latter two entering the bts at the 32nd golden disk awards on usa billboard 200 the most beautiful moment in life young january 10 2018 forever went on to win the album of the year award at the 2016 from left to right v suga jin jungkook rm melon music awards jimin and jhope their second full album wings  peaked at number 26 on the background information billboard 200 which marked the highest cha..."

str_extract_all 함수를 사용하여 수치 표현식을 제거하는 방법을 살펴봅시다.

# 제거할 숫자 확인 
# 단어를 형성하기 위해 숫자 뒤에 오는 문자를 제거하지 않겠습니다. (ex> 2nd)
# 따라서 숫자는 공백이나 문자가 뒤따라야 합니다. 
unlist(str_extract_all(bts_nopunct, " [[:digit:]]+[[:space:]]?[[:alpha:]]*"))[1:100] # 정규 표현식에서 "?"는 어떤 역할을 할까요?

##   [1] " 12 "              " 2 cool"           " 4 skool"         
##   [4] " 2013 melon"       " 2014 seoul"       " 2 "              
##   [7] " 32nd"             " 200 the"          " 10 "             
##  [10] " 2016 from"        " 26 on"            " 200 which"       
##  [13] " 15 million"       " 2016 mnet"        " 2013 present"    
##  [16] " 200 marking"      " 100 canyon"       " 85 and"          
##  [19] " 67 another"       " 28 on"            " 100 both"        
##  [22] " 12 million"       " 16 jimin"         " 2001 bts"        
##  [25] " 2017 their"       " 200 making"       " 7 million"       
##  [28] " 2016 revised"     " 2016 billboard"   " 50 chart"        
##  [31] " 78 weeks"         " 50 chart"         " 100 "            
##  [34] " 6th"              " 14th"             " 2017 they"       
##  [37] " 2017 kunreishiki" " 25 most"          " 20 "             
##  [40] " 2018 edition"     " 2017 being"       " 502 million"     
##  [43] " 2018 a"           " 2010 "            " 2015 "           
##  [46] " 2017 present"     " 2017 bts"         " 2010 "           
##  [49] " 2010 and"         " 2011 by"          " 2012 six"        
##  [52] " 2 cool"           " 4 skool"          " 12 "             
##  [55] " 2014 performing"  " 105000 copies"    " 2 were"          
##  [58] " 4 "               " 11 "              " 120000 copies"   
##  [61] " 2013melon"        " 2014 seoul"       " 200000 copies"   
##  [64] " 200000 copies"    " 28000 copies"     " 2014 bts"        
##  [67] " 2015 "            " 2015 bts"         " 1 bts"           
##  [70] " 1 "               " 27 best"          " 2015 so"         
##  [73] " 2016 in"          " 2 "               " 44 on"           
##  [76] " 100 million"      " 1have"            " 300000 copies"   
##  [79] " 42000 copies"     " 2 later"          " 30 "             
##  [82] " 171 on"           " 200 chart"        " 5000 copies"     
##  [85] " 2015 mnet"        " 1 and"            " 2 were"          
##  [88] " 40 hit"           " 10 hit"           " 20 hit"          
##  [91] " 7 "               " 44000 copies"     " 500000 copies"   
##  [94] " 6 million"        " 24 hours"         " 24 hours"        
##  [97] " 2016 mnet"        " 2017 present"     " 2017 bts"        
## [100] " 700000 copies"

bts_nonum <- str_replace_all(bts_nopunct, "[[:digit:]]+", "") # 모든 숫자 제거
str_trunc(bts_nonum, 1000)

## [1] "bts band bts hangul  rr bangtan sonyeondan also known as bts the bangtan boys is a sevenmember south korean boy band formed by big hit entertainment they debuted on june   with the song no more dream from their first album  cool  skool they won several new artist of the year awards for the track including at the  melon music awards and golden disc awards and the  seoul music awards the band continued to rise to widespread prominence with their subsequent albums dark  wild  the most beautiful moment in life part   and the most beautiful moment in life young forever  with the latter two entering the bts at the nd golden disk awards on usa billboard  the most beautiful moment in life young january   forever went on to win the album of the year award at the  from left to right v suga jin jungkook rm melon music awards jimin and jhope their second full album wings  peaked at number  on the background information billboard  which marked the highest chart ranking for a kpop also known as b..."

자 이제, 영어가 아닌 문자, 모든 숫자 및 구두점이 제거하는 방식으로 텍스트를 전처리 했습니다. 그러나 텍스트 사전처리에서 생성된 여러 개의 공백을 여전히 볼 수 있습니다.

bts_nospace <- str_replace_all(bts_nonum, "[[:space:]]{1,}", " ") # 공백 삭제 과정을 반복할 수 있습니다.

마지막으로 문자열 객체인 bts_tidy를 " "로 구분된 단어로 토큰화할 준비가 되었습니다.

bts_tidy_word <- unlist(str_split(bts_nospace, " "))
bts_tidy_word_freq <- sort(table(bts_tidy_word), decreasing = TRUE) # 각 단어 출현 빈도 수 세서, 내림차순 정렬
bts_tidy_word_freq[1:50]

## bts_tidy_word
##       the       and        in        on        to       bts        of 
##       305       108       106        98        73        72        65 
##       for     their        at     first         a      that        as 
##        63        61        50        46        45        42        39 
##     group     album     chart     music    number      they      with 
##        39        38        34        33        33        32        30 
##    korean       was    awards        by      also billboard    artist 
##        28        28        27        24        23        23        22 
##       top      year      love      over  released    single        an 
##        21        21        20        20        20        19        17 
##      most      were       its     korea      kpop     world  yourself 
##        17        17        16        16        16        16        16 
##   episode     later      song    copies      from    albums  japanese 
##        15        15        15        14        14        13        13 
##       new 
##        13

단어 빈도 테이블에서 wordcloud를 만들어봅시다.

library(wordcloud)

## Loading required package: RColorBrewer

pal <- brewer.pal(8, "Dark2") # "Dark2"에서 8가지 색상 검색
set.seed(405)
wordcloud(words = names(bts_tidy_word_freq), # 고유 단어의 열
          freq = bts_tidy_word_freq, # 단어의 빈도
          min.freq = 5, # 표시된 단어의 최소 빈도
          max.words = 500, # 빈도 순서로 표시된 500개의 단어
          random.order = FALSE, # 중앙에 위치한 최다 빈도 단어들
          rot.per = 0.1, # 플롯에서 회전하는 단어의 비율
          scale = c(4, 0.3), # 단어의 크기 범위
          colors = pal) # 단어 색상

불용어 처리?

불용어란: 언어의 의미를 구성하기 보다는 형식을 위해 기능하는 어휘들.. 의미와 상관없이 빈번하게 등장…

install.packages("tm")

library(tm)

## Loading required package: NLP

# "en" dictionary
stopwords("en")[1:10]

##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"

length(stopwords("en"))

## [1] 174

# "SMART" dictionary
stopwords("SMART")[1:10]

##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"

length(stopwords("SMART"))

## [1] 571

Let’s use the “SMART” stopwords dictionary

bts_words_nostop <- bts_tidy_word[!bts_tidy_word %in% stopwords("SMART")]
sort(table(bts_words_nostop), decreasing=T)[1:50]

## bts_words_nostop
##       bts     group     album     chart     music    number    korean 
##        72        39        38        34        33        33        28 
##    awards billboard    artist       top      year      love  released 
##        27        23        22        21        21        20        20 
##    single     korea      kpop     world   episode      song    copies 
##        19        16        16        16        15        15        14 
##    albums  japanese    social   bangtan      life   million    peaked 
##        13        13        13        12        12        12        12 
##     south      gaon      live      tour beautiful     debut       hit 
##        12        11        11        11        10        10        10 
##      june    making    moment      part     seoul   trilogy       boy 
##        10        10        10        10        10        10         9 
##     japan   members      show     wings       won  campaign  december 
##         9         9         9         9         9         8         8 
##    school 
##         8

불용어 제외한 Wordcloud

bts_words_nostop_freq <- sort(table(bts_words_nostop), decreasing=T)
wordcloud(words = names(bts_words_nostop_freq), 
                                   freq = bts_words_nostop_freq, 
                                   min.freq = 5,
                                   max.words = 1000, 
                                   random.order = FALSE, 
                                   rot.per = 0.1, 
                                   scale = c(4, 0.3), 
                                   colors = pal) # Word colors

KMOOC_Week8

Shin Lee

8/8/2018

위키피디아 텍스트의 어휘 빈도수 재분석

문자열을 전처리해봅시다.

`str_trunc` 함수가 오류를 일으킬 경우

불용어 처리?

불용어 제외한 Wordcloud

KMOOC_Week8

Shin Lee

8/8/2018

위키피디아 텍스트의 어휘 빈도수 재분석

문자열을 전처리해봅시다.

str_trunc 함수가 오류를 일으킬 경우

불용어 처리?

불용어 제외한 Wordcloud

`str_trunc` 함수가 오류를 일으킬 경우