要掌握文字探勘的技術,就必須先從斷詞開始! 什麼是斷詞?斷詞就是將文章依照詞彙來拆解,這可以幫助我們了解文章是以什麼樣的詞彙所組成,進一步找到高頻詞彙或者主要詞彙。
文字探勘的技術早期在國外發展,因此處理英文文章的套件已經發展相當成熟,如tm、tidytext。中文的相關套件則有:結巴(jiebaR), Rwordseg, tmcn。截至2019年5月為止,其中Rwordseg尚未發佈在CRAN上,必須從R Forge取得套件。另外,rJava套件對於斷詞相關的套件來說也是必須先安裝的,安裝相關套件的語法如下:
install.packages("rJava")
install.packages("jiebaR")
install.packages("tmcn")
# 安裝R-Forge 套件語法
install.packages("Rwordseg", repos = "http://R-Forge.R-project.org")
Rwordseg及jiebaR大致上的功能接近,但jiebaR目前尚有在更新,因此在接下來的操作,我會以jiebaR為主,Rwordseg的使用說明可以參閱這裡,jiebaR的說明則是可以點選這裡。先載入需要用到的套件。
library(dplyr)
library(lubridate)
library(stringr)
library(jiebaR)
library(wordcloud) # 非互動式文字雲
library(wordcloud2) # 互動式文字雲
在文字探勘的過程中,一定需要熟悉字串(string)資料的處理方式,但在追求完美以前,先看看jiebaR如何幫我們斷詞,以先前在MoneyDJ爬到的新聞為例,
# 範例文章
content <- "紐約商業交易所(NYMEX)6月原油期貨5月6日收盤上漲0.31美元或0.5%成為每桶62.25美元,因伊朗的局勢升溫,歐洲ICE期貨交易所(ICE Futures Europe)近月布蘭特原油上漲0.39美元或0.6%成為每桶71.24美元。路透社報導,美國正在向中東部署一個航母打擊群和一個轟炸機特遣部隊,美國代理國防部長稱伊朗政權的威脅是可信的。 卡達半島電視台網站5月5日報導,美國本月起取消對8個經濟體(中國、印度、日本、韓國、台灣、土耳其、義大利和希臘)購買伊朗石油的豁免,相比去年11月美國對伊朗石油出口實施制裁的時候允許這些國家在6個月內繼續購買以避免過度影響油價,顯然美國認為如今油市已經有足夠的供應。美國國務卿蓬佩奧(Mike Pompeo)表示,美國已經與主要產油國家進行溝通,希望確保油市的供應充足;加上美國國內的產油也在持續增長,這令美國有信心油市的供應不會匱乏。 不過,實際局勢可能未必如美國所想。目前有多個產油國家內政動盪並影響產量,包括阿爾及利亞、安哥拉、利比亞、伊朗、奈及利亞與委內瑞拉,一旦動盪升級,隨時會進一步影響油市供應。此外,伊朗重質原油也並非任何國家都能替代,遑論美國的輕質原油,與伊朗原油在品質上最為相近的是沙烏地阿拉伯,其次為阿拉伯聯合大公國。而如果油價因為任何供應問題再度飆升至每桶100美元,估計將令全球經濟增長削減0.6個百分點,通膨則將上揚0.7個百分點。 油田服務公司貝克休斯(Baker Hughes Inc.)公佈,截至5月3日,美國石油與天然氣探勘井數量較前週減少1座至990座,創下去年3月(990座)以來的13個月新低。其中,主要用於頁岩油氣開採的水平探勘井數量較前週持平為873座。探勘活動的增減會反映未來的石油產量,貝克休斯統計的探勘井是指為開發以及探勘新油氣儲藏所設的鑽井(鑽機)數量。 貝克休斯的數據顯示,截至5月3日,美國石油探勘井數量較前週所創的逾一年新低增加2座至807座,累計今年來仍減少78座;天然氣探勘井數量較前週減少3座至183座。與去年同期相比,美國石油探勘井數量減少27座,天然氣探勘井數量減少13座;水平探勘井數量年減2座。根據美國能源部週度預測數據,截至4月26日當週,美國原油日均產量再創新高水平至1,230萬桶。 在美國最大產油州德州,油氣探勘井數量較前週減少7座至484座,緊鄰德州上方的奧克拉荷馬州油氣探勘井數量較前週增加1座至103座,新墨西哥州油氣探勘井數量較前週增加2座至106座,路易斯安那州油氣探勘井數量較前週持平為62座,北達科他州油氣探勘井數量較前週減少1座至57座。最大頁岩油產地、盤據西德州與新墨西哥州東南部的二疊紀盆地石油探勘井數量較前週減少1座至459座。"
# 定義斷詞器
cutter <- worker(bylines = F)
# 使用斷詞器斷詞(有兩種寫法)
#segment(content, cutter)
cutter[content]
## [1] "紐約" "商業" "交易所" "NYMEX"
## [5] "6" "月" "原油期貨" "5"
## [9] "月" "6" "日" "收盤"
## [13] "上漲" "0.31" "美元" "或"
## [17] "0.5" "成為" "每桶" "62.25"
## [21] "美元" "因" "伊朗" "的"
## [25] "局勢" "升溫" "歐洲" "ICE"
## [29] "期貨" "交易所" "ICE" "Futures"
## [33] "Europe" "近" "月" "布蘭特"
## [37] "原油" "上漲" "0.39" "美元"
## [41] "或" "0.6" "成為" "每桶"
## [45] "71.24" "美元" "路透社" "報導"
## [49] "美國" "正在" "向" "中東"
## [53] "部署" "一個" "航母" "打擊"
## [57] "群" "和" "一個" "轟炸機"
## [61] "特遣部隊" "美國" "代理" "國防部長"
## [65] "稱" "伊朗" "政權" "的"
## [69] "威脅" "是" "可信" "的"
## [73] "卡達" "半島" "電視" "台"
## [77] "網站" "5" "月" "5"
## [81] "日" "報導" "美國" "本月"
## [85] "起" "取消" "對" "8"
## [89] "個" "經濟體" "中國" "印度"
## [93] "日本" "韓國" "台" "灣"
## [97] "土耳其" "義大利" "和" "希臘"
## [101] "購買" "伊朗" "石油" "的"
## [105] "豁免" "相比" "去年" "11"
## [109] "月" "美國" "對" "伊朗"
## [113] "石油" "出口" "實施" "制裁"
## [117] "的" "時候" "允許" "這些"
## [121] "國家" "在" "6" "個"
## [125] "月" "內" "繼續" "購買"
## [129] "以" "避免" "過度" "影響"
## [133] "油價" "顯然" "美國" "認為"
## [137] "如今" "油市" "已經" "有"
## [141] "足夠" "的" "供應" "美國"
## [145] "國務卿" "蓬佩奧" "Mike" "Pompeo"
## [149] "表示" "美國" "已經" "與"
## [153] "主要" "產油國" "家" "進行"
## [157] "溝通" "希望" "確保" "油市"
## [161] "的" "供應" "充足" "加上"
## [165] "美國" "國內" "的" "產油"
## [169] "也" "在" "持續增長" "這令"
## [173] "美國" "有" "信心" "油市"
## [177] "的" "供應" "不會" "匱乏"
## [181] "不過" "實際" "局勢" "可能"
## [185] "未必" "如" "美國" "所想"
## [189] "目前" "有" "多個" "產油國"
## [193] "家" "內政" "動盪" "並"
## [197] "影響" "產量" "包括" "阿爾及利亞"
## [201] "安哥拉" "利比亞" "伊朗" "奈及利亞"
## [205] "與" "委內瑞拉" "一旦" "動盪"
## [209] "升級" "隨時" "會" "進一步"
## [213] "影響" "油市" "供應" "此外"
## [217] "伊朗" "重質原油" "也" "並非"
## [221] "任何" "國家" "都" "能"
## [225] "替代" "遑論" "美國" "的"
## [229] "輕質" "原油" "與" "伊朗"
## [233] "原油" "在" "品質" "上"
## [237] "最" "為" "相近" "的"
## [241] "是" "沙烏地阿" "拉伯" "其次"
## [245] "為" "阿拉伯" "聯合" "大公國"
## [249] "而" "如果" "油價" "因為"
## [253] "任何" "供應" "問題" "再度"
## [257] "飆升" "至" "每桶" "100"
## [261] "美元" "估計" "將令" "全球"
## [265] "經濟" "增長" "削減" "0.6"
## [269] "個" "百分點" "通膨則將" "上揚"
## [273] "0.7" "個" "百分點" "油田"
## [277] "服務公司" "貝克" "休斯" "Baker"
## [281] "Hughes" "Inc" "公佈" "截至"
## [285] "5" "月" "3" "日"
## [289] "美國" "石油" "與" "天然氣"
## [293] "探勘" "井" "數量" "較前"
## [297] "週" "減少" "1" "座"
## [301] "至" "990" "座" "創下"
## [305] "去年" "3" "月" "990"
## [309] "座" "以來" "的" "13"
## [313] "個" "月" "新低" "其中"
## [317] "主要" "用於" "頁岩" "油氣"
## [321] "開採" "的" "水平" "探勘"
## [325] "井" "數量" "較前" "週"
## [329] "持平" "為" "873" "座"
## [333] "探勘" "活動" "的" "增減"
## [337] "會" "反映" "未來" "的"
## [341] "石油" "產量" "貝克" "休斯"
## [345] "統計" "的" "探勘" "井是"
## [349] "指為" "開發" "以及" "探勘"
## [353] "新" "油氣" "儲藏所" "設"
## [357] "的" "鑽井" "鑽機" "數量"
## [361] "貝克" "休斯" "的" "數據"
## [365] "顯示" "截至" "5" "月"
## [369] "3" "日" "美國" "石油"
## [373] "探勘" "井" "數量" "較前"
## [377] "週" "所創" "的" "逾"
## [381] "一年" "新低" "增加" "2"
## [385] "座" "至" "807" "座"
## [389] "累計" "今年" "來" "仍"
## [393] "減少" "78" "座" "天然氣"
## [397] "探勘" "井" "數量" "較前"
## [401] "週" "減少" "3" "座"
## [405] "至" "183" "座" "與"
## [409] "去年" "同期相比" "美國" "石油"
## [413] "探勘" "井" "數量" "減少"
## [417] "27" "座" "天然氣" "探勘"
## [421] "井" "數量" "減少" "13"
## [425] "座" "水平" "探勘" "井"
## [429] "數量" "年減" "2" "座"
## [433] "根據" "美國能源部" "週度" "預測"
## [437] "數據" "截至" "4" "月"
## [441] "26" "日" "當週" "美國"
## [445] "原油" "日均" "產量" "再創新高"
## [449] "水平" "至" "1" "230"
## [453] "萬桶" "在" "美國" "最大"
## [457] "產油" "州" "德州" "油氣"
## [461] "探勘" "井" "數量" "較前"
## [465] "週" "減少" "7" "座"
## [469] "至" "484" "座" "緊鄰"
## [473] "德州" "上方" "的" "奧克拉荷"
## [477] "馬" "州" "油氣" "探勘"
## [481] "井" "數量" "較前" "週"
## [485] "增加" "1" "座" "至"
## [489] "103" "座" "新墨西哥州" "油氣"
## [493] "探勘" "井" "數量" "較前"
## [497] "週" "增加" "2" "座"
## [501] "至" "106" "座" "路易斯安那州"
## [505] "油氣" "探勘" "井" "數量"
## [509] "較前" "週" "持平" "為"
## [513] "62" "座" "北達科他州" "油氣"
## [517] "探勘" "井" "數量" "較前"
## [521] "週" "減少" "1" "座"
## [525] "至" "57" "座" "最大"
## [529] "頁岩" "油" "產地" "盤據"
## [533] "西德" "州" "與" "新墨西哥州"
## [537] "東南部" "的" "二疊紀" "盆地"
## [541] "石油" "探勘" "井" "數量"
## [545] "較前" "週" "減少" "1"
## [549] "座" "至" "459" "座"
上面有提到,文章是依照詞彙來被斷開,那詞彙的具體內容就必須要先定義清楚,這個定義好的詞庫或是字典,稱之為語料庫(Corpus)。一般來說,由於jiebaR已經有內建的語料庫了,所以就算不特別設定也可以斷詞,然而這時候,這樣的結果可能會不符合我們的預期,若預設的語料庫沒有包含我們想關切的「關鍵字」怎麼辦?這時候我們可以加入自定義的詞彙,甚至建立自定義的辭典。
以我們剛剛的文章為例,“紐約商業交易所”、“探勘井”以及“頁岩油”等就被斷開而非視為一個詞,因此我們另外將這幾個詞加進語料庫中。
new_words <- c("紐約商業交易所","探勘井","頁岩油","輕值原油")
# 一次只能加入一個詞,常常需要搭配迴圈使用
for (i in 1:length(new_words)) {
new_user_word(cutter, new_words[i])
}
雖然我們可以任意加入自定義的新詞,但以財金領域來說,難道沒有一個財經辭典可以直接安裝在套件上嗎?當然可以,目前網路上開源的辭典已經包羅萬象,甚至有醫療領域、社會領域等專有詞典提供研究者使用。而tmcn套件本身也提供臺大的情緒辭典(NTUSD),不用特別上網下載。
除了專有名詞被斷開以外,還有許多不順眼的數字、甚至英文存在於文章中,也需要先做處理。以下這個方法會用到正規表達式,有興趣的人可以上網搜尋,它在擷取特別格式的文字時非常實用,這個技巧在其他程式語言多半也是通用的。
content <- str_remove_all(content, "[0-9a-zA-Z]+?")
cutter[content]
## [1] "紐約商業交易所" "月" "原油期貨" "月"
## [5] "日" "收盤" "上漲" "美元"
## [9] "或" "成為" "每桶" "美元"
## [13] "因" "伊朗" "的" "局勢"
## [17] "升溫" "歐洲" "期貨" "交易所"
## [21] "近" "月" "布蘭特" "原油"
## [25] "上漲" "美元" "或" "成為"
## [29] "每桶" "美元" "路透社" "報導"
## [33] "美國" "正在" "向" "中東"
## [37] "部署" "一個" "航母" "打擊"
## [41] "群" "和" "一個" "轟炸機"
## [45] "特遣部隊" "美國" "代理" "國防部長"
## [49] "稱" "伊朗" "政權" "的"
## [53] "威脅" "是" "可信" "的"
## [57] "卡達" "半島" "電視" "台"
## [61] "網站" "月" "日" "報導"
## [65] "美國" "本月" "起" "取消"
## [69] "對個" "經濟體" "中國" "印度"
## [73] "日本" "韓國" "台" "灣"
## [77] "土耳其" "義大利" "和" "希臘"
## [81] "購買" "伊朗" "石油" "的"
## [85] "豁免" "相比" "去年" "月"
## [89] "美國" "對" "伊朗" "石油"
## [93] "出口" "實施" "制裁" "的"
## [97] "時候" "允許" "這些" "國家"
## [101] "在" "個" "月" "內"
## [105] "繼續" "購買" "以" "避免"
## [109] "過度" "影響" "油價" "顯然"
## [113] "美國" "認為" "如今" "油市"
## [117] "已經" "有" "足夠" "的"
## [121] "供應" "美國" "國務卿" "蓬佩奧"
## [125] "表示" "美國" "已經" "與"
## [129] "主要" "產油國" "家" "進行"
## [133] "溝通" "希望" "確保" "油市"
## [137] "的" "供應" "充足" "加上"
## [141] "美國" "國內" "的" "產油"
## [145] "也" "在" "持續增長" "這令"
## [149] "美國" "有" "信心" "油市"
## [153] "的" "供應" "不會" "匱乏"
## [157] "不過" "實際" "局勢" "可能"
## [161] "未必" "如" "美國" "所想"
## [165] "目前" "有" "多個" "產油國"
## [169] "家" "內政" "動盪" "並"
## [173] "影響" "產量" "包括" "阿爾及利亞"
## [177] "安哥拉" "利比亞" "伊朗" "奈及利亞"
## [181] "與" "委內瑞拉" "一旦" "動盪"
## [185] "升級" "隨時" "會" "進一步"
## [189] "影響" "油市" "供應" "此外"
## [193] "伊朗" "重質原油" "也" "並非"
## [197] "任何" "國家" "都" "能"
## [201] "替代" "遑論" "美國" "的"
## [205] "輕質" "原油" "與" "伊朗"
## [209] "原油" "在" "品質" "上"
## [213] "最" "為" "相近" "的"
## [217] "是" "沙烏地阿" "拉伯" "其次"
## [221] "為" "阿拉伯" "聯合" "大公國"
## [225] "而" "如果" "油價" "因為"
## [229] "任何" "供應" "問題" "再度"
## [233] "飆升" "至" "每桶" "美元"
## [237] "估計" "將令" "全球" "經濟"
## [241] "增長" "削減" "個" "百分點"
## [245] "通膨則將" "上揚" "個" "百分點"
## [249] "油田" "服務公司" "貝克" "休斯"
## [253] "公佈" "截至" "月" "日"
## [257] "美國" "石油" "與" "天然氣"
## [261] "探勘井" "數量" "較前" "週"
## [265] "減少" "座" "至" "座"
## [269] "創下" "去年" "月" "座"
## [273] "以來" "的" "個" "月"
## [277] "新低" "其中" "主要" "用於"
## [281] "頁岩油" "氣" "開採" "的"
## [285] "水平" "探勘井" "數量" "較前"
## [289] "週" "持平" "為座" "探勘"
## [293] "活動" "的" "增減" "會"
## [297] "反映" "未來" "的" "石油"
## [301] "產量" "貝克" "休斯" "統計"
## [305] "的" "探勘井" "是" "指為"
## [309] "開發" "以及" "探勘" "新"
## [313] "油氣" "儲藏所" "設" "的"
## [317] "鑽井" "鑽機" "數量" "貝克"
## [321] "休斯" "的" "數據" "顯示"
## [325] "截至" "月" "日" "美國"
## [329] "石油" "探勘井" "數量" "較前"
## [333] "週" "所創" "的" "逾"
## [337] "一年" "新低" "增加" "座"
## [341] "至" "座" "累計" "今年"
## [345] "來" "仍" "減少" "座"
## [349] "天然氣" "探勘井" "數量" "較前"
## [353] "週" "減少" "座" "至"
## [357] "座" "與" "去年" "同期相比"
## [361] "美國" "石油" "探勘井" "數量"
## [365] "減少" "座" "天然氣" "探勘井"
## [369] "數量" "減少" "座" "水平"
## [373] "探勘井" "數量" "年" "減座"
## [377] "根據" "美國能源部" "週度" "預測"
## [381] "數據" "截至" "月" "日"
## [385] "當週" "美國" "原油" "日均"
## [389] "產量" "再創新高" "水平" "至"
## [393] "萬桶" "在" "美國" "最大"
## [397] "產油" "州" "德州" "油氣"
## [401] "探勘井" "數量" "較前" "週"
## [405] "減少" "座" "至" "座"
## [409] "緊鄰" "德州" "上方" "的"
## [413] "奧克拉荷" "馬" "州" "油氣"
## [417] "探勘井" "數量" "較前" "週"
## [421] "增加" "座" "至" "座"
## [425] "新墨西哥州" "油氣" "探勘井" "數量"
## [429] "較前" "週" "增加" "座"
## [433] "至" "座" "路易斯安那州" "油氣"
## [437] "探勘井" "數量" "較前" "週"
## [441] "持平" "為座" "北達科他州" "油氣"
## [445] "探勘井" "數量" "較前" "週"
## [449] "減少" "座" "至" "座"
## [453] "最大" "頁岩油" "產地" "盤據"
## [457] "西德" "州" "與" "新墨西哥州"
## [461] "東南部" "的" "二疊紀" "盆地"
## [465] "石油" "探勘井" "數量" "較前"
## [469] "週" "減少" "座" "至"
## [473] "座"
的、然後、於、是、在等贅詞在斷詞時也是一個問題,這些之乎者也對於分析文章的幫助非常有限,而且他們出現的頻率又非常高,jiebaR當然也可以處理。
在英文的文字探勘中,停止詞(stop words)的篩選本身也很重要,中文也不例外,我們可以透過外部匯入文件檔(.txt)的方式匯入停止詞(當然,新詞也可以使用匯入的方式)。而停止詞的定義,可以自己隨意定義字串並存成文件檔(.txt),讓斷詞器匯入,亦或是從網路上下載定義好的停止詞庫,概念跟下載外部的語料庫相同。
# 匯出新詞
new_words <- c("紐約商業交易所","探勘井","頁岩油","輕值原油")
writeLines(new_words, "new_words.txt")
# 設定停止詞
stop_words <- c("在","的","下","個","來","至","座","亦","與","或","日","月","年","週")
writeLines(stop_words, "stop_words.txt")
# 重新定義斷詞器,匯入停止詞
cutter <- worker(user = "new_words.txt", stop_word = "stop_words.txt", bylines = FALSE)
seg_words <- cutter[content]
seg_words
## [1] "紐約商業交易所" "原油期貨" "收盤" "上漲"
## [5] "美元" "成為" "每桶" "美元"
## [9] "因" "伊朗" "局勢" "升溫"
## [13] "歐洲" "期貨" "交易所" "近"
## [17] "布蘭特" "原油" "上漲" "美元"
## [21] "成為" "每桶" "美元" "路透社"
## [25] "報導" "美國" "正在" "向"
## [29] "中東" "部署" "一個" "航母"
## [33] "打擊" "群" "和" "一個"
## [37] "轟炸機" "特遣部隊" "美國" "代理"
## [41] "國防部長" "稱" "伊朗" "政權"
## [45] "威脅" "是" "可信" "卡達"
## [49] "半島" "電視" "台" "網站"
## [53] "報導" "美國" "本月" "起"
## [57] "取消" "對個" "經濟體" "中國"
## [61] "印度" "日本" "韓國" "台"
## [65] "灣" "土耳其" "義大利" "和"
## [69] "希臘" "購買" "伊朗" "石油"
## [73] "豁免" "相比" "去年" "美國"
## [77] "對" "伊朗" "石油" "出口"
## [81] "實施" "制裁" "時候" "允許"
## [85] "這些" "國家" "內" "繼續"
## [89] "購買" "以" "避免" "過度"
## [93] "影響" "油價" "顯然" "美國"
## [97] "認為" "如今" "油市" "已經"
## [101] "有" "足夠" "供應" "美國"
## [105] "國務卿" "蓬佩奧" "表示" "美國"
## [109] "已經" "主要" "產油國" "家"
## [113] "進行" "溝通" "希望" "確保"
## [117] "油市" "供應" "充足" "加上"
## [121] "美國" "國內" "產油" "也"
## [125] "持續增長" "這令" "美國" "有"
## [129] "信心" "油市" "供應" "不會"
## [133] "匱乏" "不過" "實際" "局勢"
## [137] "可能" "未必" "如" "美國"
## [141] "所想" "目前" "有" "多個"
## [145] "產油國" "家" "內政" "動盪"
## [149] "並" "影響" "產量" "包括"
## [153] "阿爾及利亞" "安哥拉" "利比亞" "伊朗"
## [157] "奈及利亞" "委內瑞拉" "一旦" "動盪"
## [161] "升級" "隨時" "會" "進一步"
## [165] "影響" "油市" "供應" "此外"
## [169] "伊朗" "重質原油" "也" "並非"
## [173] "任何" "國家" "都" "能"
## [177] "替代" "遑論" "美國" "輕質"
## [181] "原油" "伊朗" "原油" "品質"
## [185] "上" "最" "為" "相近"
## [189] "是" "沙烏地阿" "拉伯" "其次"
## [193] "為" "阿拉伯" "聯合" "大公國"
## [197] "而" "如果" "油價" "因為"
## [201] "任何" "供應" "問題" "再度"
## [205] "飆升" "每桶" "美元" "估計"
## [209] "將令" "全球" "經濟" "增長"
## [213] "削減" "百分點" "通膨則將" "上揚"
## [217] "百分點" "油田" "服務公司" "貝克"
## [221] "休斯" "公佈" "截至" "美國"
## [225] "石油" "天然氣" "探勘井" "數量"
## [229] "較前" "減少" "創下" "去年"
## [233] "以來" "新低" "其中" "主要"
## [237] "用於" "頁岩油" "氣" "開採"
## [241] "水平" "探勘井" "數量" "較前"
## [245] "持平" "為座" "探勘" "活動"
## [249] "增減" "會" "反映" "未來"
## [253] "石油" "產量" "貝克" "休斯"
## [257] "統計" "探勘井" "是" "指為"
## [261] "開發" "以及" "探勘" "新"
## [265] "油氣" "儲藏所" "設" "鑽井"
## [269] "鑽機" "數量" "貝克" "休斯"
## [273] "數據" "顯示" "截至" "美國"
## [277] "石油" "探勘井" "數量" "較前"
## [281] "所創" "逾" "一年" "新低"
## [285] "增加" "累計" "今年" "仍"
## [289] "減少" "天然氣" "探勘井" "數量"
## [293] "較前" "減少" "去年" "同期相比"
## [297] "美國" "石油" "探勘井" "數量"
## [301] "減少" "天然氣" "探勘井" "數量"
## [305] "減少" "水平" "探勘井" "數量"
## [309] "減座" "根據" "美國能源部" "週度"
## [313] "預測" "數據" "截至" "當週"
## [317] "美國" "原油" "日均" "產量"
## [321] "再創新高" "水平" "萬桶" "美國"
## [325] "最大" "產油" "州" "德州"
## [329] "油氣" "探勘井" "數量" "較前"
## [333] "減少" "緊鄰" "德州" "上方"
## [337] "奧克拉荷" "馬" "州" "油氣"
## [341] "探勘井" "數量" "較前" "增加"
## [345] "新墨西哥州" "油氣" "探勘井" "數量"
## [349] "較前" "增加" "路易斯安那州" "油氣"
## [353] "探勘井" "數量" "較前" "持平"
## [357] "為座" "北達科他州" "油氣" "探勘井"
## [361] "數量" "較前" "減少" "最大"
## [365] "頁岩油" "產地" "盤據" "西德"
## [369] "州" "新墨西哥州" "東南部" "二疊紀"
## [373] "盆地" "石油" "探勘井" "數量"
## [377] "較前" "減少"
完成了斷詞之後,才是真正的開始,通常第2步驟就是計算詞彙的頻率,通過詞彙的頻率我們就可以直接使用文字雲的套件wordcloud來視覺化文章的重點了!
# 計算詞彙頻率
txt_freq <- freq(seg_words)
# 由大到小排列
txt_freq <- arrange(txt_freq, desc(freq))
# 檢查前5名
head(txt_freq)
## char freq
## 1 美國 16
## 2 探勘井 14
## 3 數量 14
## 4 較前 10
## 5 減少 8
## 6 伊朗 7
文字雲套件主要有兩個,wordcloud套件是文字雲的基本款,主要輸出靜態的圖片;wordcloud2顧名思義就是前一個套件的進階版,主要提供互動式的圖片,非常適用在Shiny等網頁中。然而需要注意的是,我認為一般wordcloud的參數比較完整,且兩者參數的命名不盡相同,注意不要混淆了。
par(family=("Microsoft YaHei")) #一般wordcloud需要定義字體,不然會無法顯示中文
# 一般的文字雲 (pkg: wordcloud)
wordcloud(txt_freq$char, txt_freq$freq, min.freq = 2, random.order = F, ordered.colors = F, colors = rainbow(nrow(txt_freq)))
# 互動式文字雲 (pkg: wordcloud2)
wordcloud2(filter(txt_freq, freq > 1),
minSize = 2, fontFamily = "Microsoft YaHei", size = 1)
在使用文字雲時,最小頻率門檻的設定非常重要,設想當我們手上有上萬篇文章時,斷出來的詞彙不會只有100個這麼簡單,我們不需要將所有的詞彙都繪在文字雲中,因此以一般文字雲的wordcloud()函數來說,min.freq參數非常重要;然而以互動式文字雲來說,wordcloud2()函數似乎沒有提供這種設定,因此我們可以利用dplyr的filter()來自行先篩選出需要的詞彙。
下一篇文章將會介紹如何將文字探勘運用在財金應用上。