系統參數設定
Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼## [1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8"
安裝需要的packages
packages = c("readr", "dplyr", "jiebaR", "tidyr", "tidytext", "igraph", "topicmodels", "ggplot2", "stringr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)讀進library
library(readr)
library(dplyr)
library(jiebaR)
library(tidyr)
library(tidytext)
library(igraph)
library(topicmodels)
library(stringr)
library(ggplot2)
library(reshape2)
library(wordcloud2)
library(lubridate)
library(jsonlite)
library(purrr)
library(LDAvis)
library(slam)
library(servr)這次我們針贈送疫苗事件,討論ptt版上相關討論的發文風向,主要針對以下方向分析:
1.日本贈送台灣疫苗的討論重點有哪些? 主要分為哪幾種風向? 2.目前風向最偏哪邊? 3.討論關於贈送疫苗社群網路如何分布? 4.贈送疫苗的意見領袖有誰?網友的推噓狀態如何?
knitr::include_graphics("./pic/1.png")Gossiping/HatePolitics/nCov2019 版文章、回覆
knitr::include_graphics("./pic/2.png")總共1462篇文章
knitr::include_graphics("./pic/3.png")拆解文章Json結構推噓文
posts = read.csv("./data/data.csv")
reviews = do.call(rbind, lapply(1:nrow(posts), function(i) {
# transfer string to dataframe
comment_ = fromJSON(posts$artComment[i])
# check number of comment
if(length(comment_) == 0){
NULL
}else{
# add artPoster form source-data
comment_$artPoster = posts$artPoster[i]
comment_$artUrl = posts$artUrl[i]
comment_$artTitle = posts$artTitle[i]
comment_$artDate =as_date(as.POSIXct(posts$artDate[i]))
comment_
}
}))在本篇分析中,我們希望建構特定議題的社群網路圖,並分析網路中討論的議題主題
我們需要兩種資料: (1) 每篇文章的主題分類(LDA) (2) 社群網路圖的link和nodes
head(posts)## system_id artUrl
## 1 1 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 2 2 https://www.ptt.cc/bbs/Gossiping/M.1620030314.A.C24.html
## 3 3 https://www.ptt.cc/bbs/Gossiping/M.1620360816.A.B31.html
## 4 4 https://www.ptt.cc/bbs/Gossiping/M.1620486461.A.35D.html
## 5 5 https://www.ptt.cc/bbs/Gossiping/M.1620580087.A.7FB.html
## 6 6 https://www.ptt.cc/bbs/Gossiping/M.1620608245.A.78D.html
## artTitle artDate artPoster
## 1 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01 05:22:04 moonshade
## 2 Re:[問卦]韓國人最近一直在YouTube酸台灣的掛 2021-05-03 16:25:12 zhwang2123
## 3 Re:[問卦]萬一大麻合法化了,會發生什麼事 2021-05-07 12:13:34 kurt911
## 4 [新聞]嘉南大圳開工百年安倍晉三感謝台灣珍惜 2021-05-08 23:07:39 SAgirl
## 5 [問卦]台灣人該如何勸日本人才會乖乖開始防疫? 2021-05-10 01:08:05 JUNOCARE
## 6 [問卦]有沒有中國人的原則是什麼? 2021-05-10 08:57:23 myIDis7
## artCatagory
## 1 Gossiping
## 2 Gossiping
## 3 Gossiping
## 4 Gossiping
## 5 Gossiping
## 6 Gossiping
## artContent
## 1 日本新聞講的觀點是\n\n 很多公司不願意或沒有辦法讓員工在家上班,員工\n 來上班了就會有聚會,沒辦法他們的習慣就是這樣,\n 尤其是上司要求了無法拒絕,大阪府的公務員就是\n 去聚會整個單位都中,身為公家機關沒辦法以身作\n 則已經被罵到翻掉\n\n 再來是經過一年彈性疲乏,在春季的時候因為沒有\n 緊急事態很多人都照常活動,那些宣導都是柔性\n 並沒有強制很多店家都開放店內用餐,採取的\n 一些措施但是都是做心安的,像是測量二氧化碳\n 然後開窗換氣,這還是東京都,外面的縣市大多\n 都沒在管照常一般營業,尤其是這次先掛掉的\n 大阪,一開始大阪掛掉東京鄉民還在嘲笑他們\n 過沒多久東京也跟著爆了\n\n 新型變異種對年輕人傳染力高,有好幾個小學\n 都爆出幾乎全班中,都是中新型變異種,然後\n 日本疫苗現在也只有開始打老人,線上登記問題一堆\n 連公務員都不會用,現場排隊登記大排長龍\n\n\n
## 2 看了內容 韓國人會這樣酸是好事吧\n\n裡面幾部都是把台灣和韓國做比較\n\n可能因為台灣之前疫情守住 台積電與某些IC產業又領先三星的關係\n\n人均快超車韓國 韓國才會開始有一連串報導\n\n以前都台灣人在罵韓國xxx怎樣的\n\n韓國比較沒在意 那是因為韓國沒把台灣放在眼裡\n\n現在追進開始被比較就會在意了\n\n別看現在台日友好\n\n哪天台灣能追近日本 大概也差不多\n\n\n\n
## 3 農民搶種\n每公斤跌至2萬塊\n麻農哭喊活不下去\n要求政府補助\n到總統府前倒大麻抗議\n民眾發起買大麻挺麻農運動\n中國禁止進口台灣大麻\n日本力挺台灣大麻\n台日友好\n因為大量開墾坡地種大麻\n颱風來時發生土石流\n\n大概是這樣吧\n
## 4 https://www.cna.com.tw/news/aopl/202105080232.aspx\n嘉南大圳開工百年 安倍晉三感謝台灣珍惜歷史\n最新更新:2021/05/08 22:56\n\n(中央社記者楊明珠東京8日專電)今年是嘉南大圳最主要工程之一的烏山頭水庫開工百\n年,八田與一紀念園區今天舉辦紀念活動,台日視訊轉播。日本前首相安倍晉三預錄影片\n致詞感謝台灣珍惜關於八田與一的歷史。\n\n為紀念嘉南大圳開工滿百週年,文化部、農委會、台南市政府等單位今天在台南市的八田\n與一紀念公園合辦紀念典禮,透過電視轉播,同步於台南市及八田與一的故鄉日本石川縣\n金澤市舉行活動。\n\n駐日代表謝長廷率駐日代表處官員協助主辦單位洽邀日本前首相安倍晉三、 日本台灣交\n流協會會長大橋光夫等人錄製致詞影片。\n\n安倍表示,八田與一銅像是呈現穿著工作靴的模樣,這讓他感受到大家不希望忘記八田與\n一辛勞工作的身影。他很感謝台灣珍惜關於八田與一的歷史,台灣總是帶給日本人很多美\n好的回憶,那是只有真正的朋友才能締結的深刻心靈交流。\n\n大橋表示,感謝台灣一直傳頌八田與一,日台交流協會也將持續加深前人留下、令人驕傲\n的日台友好情誼,並傳承給下一代。\n\n謝長廷也藉影片致詞表示,100年前排除萬難才有今日的百年大圳,俗語說「吃果子要拜\n樹頭、飲水要思源頭」,在八田工程師冥誕的5月8日這個日子,緬懷他和工程團隊的功績\n,也悼念在工程中傷亡的134位員工。建設這座水庫的歷史是台灣的歷史,也是日本的歷\n史,期盼這座百年大圳永遠成為台日友好的橋樑。\n\n典禮也規劃錄製日本政要的祝福影片,計有日本前首相森喜朗、日華議員懇談會會長古屋\n圭司、眾議員衛藤征士郎、台裔日籍參議員蓮舫、參議員瀧波宏文以及日本台灣之友會會\n長黑須隆一等獻上祝福。\n\n森喜朗表示,他是從前總統李登輝那裡聽到有關八田與一的事,令他十分感動,期盼2019\n冠狀病毒疾病(COVID-19)疫情早日平息,台日民眾得以重啟交流。\n\n蓮舫表示,台南是她父親的故鄉,她曾造訪八田與一紀念公園,看到銅像是八田與一穿著\n工作服、坐在地上的姿態,她感覺八田與一對台灣來說是那麼親近的存在,令她很感動。\n\n八田與一的孫子八田修一表示,看到台灣人直到現在仍珍惜使用嘉南大圳和烏山頭水庫,\n祖父八田與一若有知,必定感動落淚。(編輯:周永捷)1100508\n\n6.備註:\n台灣說慰安婦是自願的 你也很感謝是吧?\n
## 5 如題\n\n全世界現在剩兩個疫情爆發中的大國\n\n一個是印度\n\n一個就是日本了\n\n日本還有黃金周假期\n\n大家忙著外出濃密接觸\n\n導致確診速度越來越高\n\n感染率已經能與印度較勁\n\n這樣下去別說今年了\n\n明年都不一定能去日本玩\n\n為了日本的拉麵和燒肉\n\n身為台日友好的台灣這邊\n\n到底該怎麼跟日本人勸說\n\n日本人才會恍然大悟\n\n乖乖待在家裡防疫呢?\n\n有沒有相關八卦\n\n\n\n
## 6 看隔壁板討論天下布魔\n\n原本遊戲歸遊戲 政治歸政治\n\n但因為找pan代言 一堆中國人因為原則就退遊了\n\npan被舉報台獨\n因為按了台日友好的留言讚\n黑人問號.jpg\n\n中國人的原則是翻牆打色情遊戲嗎?\n\n有沒有八卦\n
## artComment
## 1 [{"cmtStatus": "推", "cmtPoster": "hipmyhop", "cmtContent": ":台灣現在就是這樣啊對變種病毒太掉以輕心", "cmtDate": "2021-05-01 05:28:00"}, {"cmtStatus": "→", "cmtPoster": "hipmyhop", "cmtContent": ":祈禱五月最好沒事否則大家準備飛高高", "cmtDate": "2021-05-01 05:29:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":現在就是彈性疲乏除非一口氣死亡率提高到50%", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":甚至每天都是死幾百幾千才有可能恐慌", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":看資本市場的資金流通/民眾的應變就知道了", "cmtDate": "2021-05-01 05:33:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":單純是防到彈性疲乏畢竟死亡率沒有民眾認知來的可怕", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":今天又不是說100人很容易感染60+然後感染者又死亡率", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":輕易超越50%以上所以彈性疲乏也很正常", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "iampig951753", "cmtContent": ":人為財死如果這你還不懂就是沒有慧根", "cmtDate": "2021-05-01 05:40:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":本來就是死亡率又不高,整天綁手綁腳的不覺得CP值很低嗎.", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":..", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "snow3804", "cmtContent": ":沒那摸倒楣", "cmtDate": "2021-05-01 06:14:00"}, {"cmtStatus": "→", "cmtPoster": "annielu", "cmtContent": ":噓某樓,死亡率不高你去得啊!有後遺症不要來又哭爸哭媽", "cmtDate": "2021-05-01 06:19:00"}, {"cmtStatus": "噓", "cmtPoster": "clerkhsiao", "cmtContent": ":住美國對日本這麼了解?", "cmtDate": "2021-05-01 06:49:00"}, {"cmtStatus": "→", "cmtPoster": "moonshade", "cmtContent": ":我都看日本新聞練習日文", "cmtDate": "2021-05-01 06:52:00"}, {"cmtStatus": "推", "cmtPoster": "ifyoutry", "cmtContent": ":數學不錯喔還知道死亡率不高,以2%死亡率來計算假設台灣", "cmtDate": "2021-05-01 07:27:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":都一年多了怎麼還有人覺得死亡率不高所以沒關係這個病", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":的特點在於傳染力極強一旦爆發會瞬間癱瘓醫療系統到", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":時連得其他病的人也會死掉", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":人有八成的人得到那大概會死三十七萬人,大概就噴掉一個", "cmtDate": "2021-05-01 07:29:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":基隆市的人口而已.", "cmtDate": "2021-05-01 07:31:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":當然講八成是危言聳聽啦,但即使取十分之一也得死三萬多", "cmtDate": "2021-05-01 07:35:00"}, {"cmtStatus": "噓", "cmtPoster": "takasaki", "cmtContent": ":死亡率不到2%好嗎...", "cmtDate": "2021-05-01 08:41:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":就算2%都不高阿....何況不到2%", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":要讓我中就中阿反正基本死不了", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":整天搞東搞西到底是有完沒完,70億人2%也就1.4億人而已還", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":可以降低資源需求的壓力", "cmtDate": "2021-05-01 13:17:00"}]
## 2 [{"cmtStatus": "推", "cmtPoster": "realtw", "cmtContent": ":人均GDP臺灣是造假的人均收入呢臺灣差韓國還差得遠", "cmtDate": "2021-05-03 16:26:00"}, {"cmtStatus": "→", "cmtPoster": "realtw", "cmtContent": ":臺灣的人均GDP即使比韓國高兩倍人均收入低於韓國還是被笑", "cmtDate": "2021-05-03 16:27:00"}, {"cmtStatus": "→", "cmtPoster": "realtw", "cmtContent": ":事實上臺灣在韓國的眼中還是東南亞落後地區", "cmtDate": "2021-05-03 16:27:00"}, {"cmtStatus": "→", "cmtPoster": "tony121010", "cmtContent": ":支那人跟人談造假耶", "cmtDate": "2021-05-03 16:27:00"}, {"cmtStatus": "→", "cmtPoster": "tony121010", "cmtContent": ":是不是世界奇觀啊XD", "cmtDate": "2021-05-03 16:28:00"}, {"cmtStatus": "→", "cmtPoster": "anmico", "cmtContent": ":最愛造假的就中共都全面脫貧了XD", "cmtDate": "2021-05-03 16:28:00"}, {"cmtStatus": "→", "cmtPoster": "eric999", "cmtContent": ":能追上韓國就很厲害啦,還想追日本喔,台日的人口數跟科", "cmtDate": "2021-05-03 16:28:00"}, {"cmtStatus": "→", "cmtPoster": "eric999", "cmtContent": ":技程度根本深淵博大的差距。", "cmtDate": "2021-05-03 16:28:00"}, {"cmtStatus": "→", "cmtPoster": "anmico", "cmtContent": ":中國人是不是很怕台韓友好", "cmtDate": "2021-05-03 16:28:00"}, {"cmtStatus": "→", "cmtPoster": "soria", "cmtContent": ":反正韓國疫情已經徹徹底底炸了隨便啦", "cmtDate": "2021-05-03 16:29:00"}, {"cmtStatus": "推", "cmtPoster": "popo123456", "cmtContent": ":台灣gdp藏在企業,支拿人沒什麼常識,可憐", "cmtDate": "2021-05-03 16:29:00"}, {"cmtStatus": "→", "cmtPoster": "eric999", "cmtContent": ":哈,某些支那人對台灣就是有莫名恨意,這就是兩岸為什麼", "cmtDate": "2021-05-03 16:29:00"}, {"cmtStatus": "→", "cmtPoster": "eric999", "cmtContent": ":不能統一的主因。", "cmtDate": "2021-05-03 16:29:00"}, {"cmtStatus": "→", "cmtPoster": "Qaaaa", "cmtContent": ":台灣不也是瘋狂開酸韓國嗎?那就辦經巴嶺啊", "cmtDate": "2021-05-03 16:30:00"}, {"cmtStatus": "推", "cmtPoster": "askacis", "cmtContent": ":支那人說別人造假XDDD", "cmtDate": "2021-05-03 16:30:00"}, {"cmtStatus": "→", "cmtPoster": "tony121010", "cmtContent": ":支那就自卑轉自大,可悲的民族", "cmtDate": "2021-05-03 16:30:00"}, {"cmtStatus": "→", "cmtPoster": "eric999", "cmtContent": ":那是因為當年國民黨故意打反韓牌來帶走反中的方向。不懂", "cmtDate": "2021-05-03 16:32:00"}, {"cmtStatus": "→", "cmtPoster": "eric999", "cmtContent": ":可以去google.那個故意找個阿婆穿韓服一邊奸笑一邊翻牌", "cmtDate": "2021-05-03 16:32:00"}, {"cmtStatus": "→", "cmtPoster": "eric999", "cmtContent": ":的國民黨政治廣告真的超經典", "cmtDate": "2021-05-03 16:32:00"}, {"cmtStatus": "推", "cmtPoster": "tokuchi2013", "cmtContent": ":被韓國酸我們應該要開心啊,表示我們開始被韓國人重", "cmtDate": "2021-05-03 16:33:00"}, {"cmtStatus": "→", "cmtPoster": "tokuchi2013", "cmtContent": ":視了!", "cmtDate": "2021-05-03 16:33:00"}, {"cmtStatus": "→", "cmtPoster": "tony121010", "cmtContent": ":支那連結樓下幫點", "cmtDate": "2021-05-03 16:34:00"}, {"cmtStatus": "→", "cmtPoster": "TOKAN", "cmtContent": ":453&mid=XG08H345Q", "cmtDate": "2021-05-03 16:34:00"}, {"cmtStatus": "→", "cmtPoster": "TOKAN", "cmtContent": ":懶得縮網址,這是中國人評論台灣gdp", "cmtDate": "2021-05-03 16:34:00"}, {"cmtStatus": "推", "cmtPoster": "soria", "cmtContent": ":疫情讓一個國家停擺快兩年說真的蠻可怕的難怪他們抓狂", "cmtDate": "2021-05-03 16:35:00"}, {"cmtStatus": "→", "cmtPoster": "TOKAN", "cmtContent": ":台灣gdp成長根本不是靠三角貿易,realtw功課自己去做吧", "cmtDate": "2021-05-03 16:35:00"}, {"cmtStatus": "推", "cmtPoster": "chen730419", "cmtContent": ":去過南韓~一點也不覺得他們有多先進呀!", "cmtDate": "2021-05-03 16:36:00"}, {"cmtStatus": "→", "cmtPoster": "salkuo", "cmtContent": ":一樓或是其他支五毛的論點早就看膩了每次也都是那幾個特", "cmtDate": "2021-05-03 16:36:00"}, {"cmtStatus": "→", "cmtPoster": "salkuo", "cmtContent": ":定議題跟蟑螂藥一樣會吸一堆出來回覆一些罐頭回應真的也", "cmtDate": "2021-05-03 16:37:00"}, {"cmtStatus": "→", "cmtPoster": "salkuo", "cmtContent": ":是.....", "cmtDate": "2021-05-03 16:37:00"}, {"cmtStatus": "→", "cmtPoster": "ILike58", "cmtContent": ":他們旅遊熱點過去一年超慘的,一堆商店撤租。", "cmtDate": "2021-05-03 16:39:00"}, {"cmtStatus": "→", "cmtPoster": "TOKAN", "cmtContent": ":從貼一次好了", "cmtDate": "2021-05-03 16:39:00"}, {"cmtStatus": "→", "cmtPoster": "TOKAN", "cmtContent": ":中國人自己做的分析比realcn客觀多了真是廣西小丑", "cmtDate": "2021-05-03 16:40:00"}, {"cmtStatus": "推", "cmtPoster": "coldqoo", "cmtContent": ":看到韓國的文字,眼睛都@@,日文還有漢字可以猜,韓文全", "cmtDate": "2021-05-03 16:41:00"}, {"cmtStatus": "→", "cmtPoster": "coldqoo", "cmtContent": ":看不懂,連台灣寫在哪裡都不知道@@", "cmtDate": "2021-05-03 16:41:00"}, {"cmtStatus": "推", "cmtPoster": "fransiceyho", "cmtContent": ":韓國沒進入大企業的話收入和台灣差不多", "cmtDate": "2021-05-03 17:23:00"}, {"cmtStatus": "→", "cmtPoster": "fransiceyho", "cmtContent": ":韓國更慘的是一堆沒保障的派遣工", "cmtDate": "2021-05-03 17:24:00"}, {"cmtStatus": "推", "cmtPoster": "sid300", "cmtContent": ":韓國一直是台灣的敵人", "cmtDate": "2021-05-03 19:52:00"}, {"cmtStatus": "推", "cmtPoster": "lkdsa", "cmtContent": ":同感,覺得能被討論其實代表他們把台灣放在眼裡了,別怕被", "cmtDate": "2021-05-03 20:52:00"}, {"cmtStatus": "→", "cmtPoster": "lkdsa", "cmtContent": ":討厭,中國日本被討厭的才嚴重咧,他們都活好好的,被討厭", "cmtDate": "2021-05-03 20:52:00"}, {"cmtStatus": "→", "cmtPoster": "lkdsa", "cmtContent": ":的勇氣哈哈", "cmtDate": "2021-05-03 20:52:00"}, {"cmtStatus": "→", "cmtPoster": "lkdsa", "cmtContent": ":覺得被酸很榮幸!!過去他們可是不把台放在眼裡的", "cmtDate": "2021-05-03 20:54:00"}, {"cmtStatus": "推", "cmtPoster": "kenuser", "cmtContent": ":台灣的派遣工也不遑多讓,但是正視自己的缺點努力改進才", "cmtDate": "2021-05-03 23:29:00"}, {"cmtStatus": "→", "cmtPoster": "kenuser", "cmtContent": ":會進步", "cmtDate": "2021-05-03 23:29:00"}]
## 3 [{"cmtStatus": "→", "cmtPoster": "douge", "cmtContent": ":一堆街上被酒駕大麻仔撞死的無辜人命加上路邊都是尿騷味", "cmtDate": "2021-05-07 12:14:00"}, {"cmtStatus": "→", "cmtPoster": "douge", "cmtContent": ":大概是這樣子", "cmtDate": "2021-05-07 12:14:00"}, {"cmtStatus": "推", "cmtPoster": "phix", "cmtContent": ":變成吸麻開車撞死人", "cmtDate": "2021-05-07 12:14:00"}, {"cmtStatus": "噓", "cmtPoster": "notmine", "cmtContent": ":想太多種在平地就好了", "cmtDate": "2021-05-07 12:14:00"}, {"cmtStatus": "推", "cmtPoster": "BKcrow", "cmtContent": ":議員立委一定搶著種", "cmtDate": "2021-05-07 12:17:00"}, {"cmtStatus": "推", "cmtPoster": "jimopon", "cmtContent": ":不會接下來是要考慮如何合法出口", "cmtDate": "2021-05-07 12:18:00"}, {"cmtStatus": "推", "cmtPoster": "michaelwu", "cmtContent": ":xDD今年颱風大麻收成不好", "cmtDate": "2021-05-07 12:24:00"}, {"cmtStatus": "噓", "cmtPoster": "mhfo3035", "cmtContent": ":想太多大麻種室內都可以", "cmtDate": "2021-05-07 12:26:00"}, {"cmtStatus": "噓", "cmtPoster": "lockyola05", "cmtContent": ":光台灣藝人就呼到供不應求了還跌", "cmtDate": "2021-05-07 12:27:00"}, {"cmtStatus": "→", "cmtPoster": "douge", "cmtContent": ":衛星都照的到種哪都沒用", "cmtDate": "2021-05-07 12:28:00"}, {"cmtStatus": "推", "cmtPoster": "mk5520", "cmtContent": ":一樓智商堪慮會有尿騷味的是k他命吧", "cmtDate": "2021-05-07 14:26:00"}]
## 4 [{"cmtStatus": "推", "cmtPoster": "Padfoneman", "cmtContent": ":又八又一又三的", "cmtDate": "2021-05-08 23:09:00"}, {"cmtStatus": "推", "cmtPoster": "omfg5487", "cmtContent": ":謝謝你感謝台灣感謝殖民", "cmtDate": "2021-05-08 23:09:00"}, {"cmtStatus": "→", "cmtPoster": "nerevian", "cmtContent": ":斯德哥爾摩", "cmtDate": "2021-05-08 23:10:00"}, {"cmtStatus": "→", "cmtPoster": "botnet", "cmtContent": ":日本那時真的有心長期建設台灣", "cmtDate": "2021-05-08 23:10:00"}, {"cmtStatus": "噓", "cmtPoster": "sali921", "cmtContent": ":台灣女人本來就欠我大天皇民族幹啊", "cmtDate": "2021-05-08 23:11:00"}, {"cmtStatus": "→", "cmtPoster": "pf775", "cmtContent": ":日台親善本一家~", "cmtDate": "2021-05-08 23:13:00"}, {"cmtStatus": "推", "cmtPoster": "koxinga", "cmtContent": ":日本算是有計劃長期經營殖民地,不少建設到今天仍然有用", "cmtDate": "2021-05-08 23:16:00"}, {"cmtStatus": "推", "cmtPoster": "AWPER", "cmtContent": ":反觀藍黨", "cmtDate": "2021-05-08 23:16:00"}, {"cmtStatus": "噓", "cmtPoster": "andy199113", "cmtContent": ":爛咖落跑安倍害死日本", "cmtDate": "2021-05-08 23:19:00"}, {"cmtStatus": "噓", "cmtPoster": "nutrino", "cmtContent": ":笑死惹斯德哥爾摩症候群", "cmtDate": "2021-05-08 23:20:00"}, {"cmtStatus": "推", "cmtPoster": "Francix", "cmtContent": ":要不是國民黨那麼爛,台灣人那會懷念日本殖民", "cmtDate": "2021-05-08 23:22:00"}, {"cmtStatus": "噓", "cmtPoster": "poggssi", "cmtContent": ":滾啦", "cmtDate": "2021-05-08 23:22:00"}, {"cmtStatus": "推", "cmtPoster": "allan0926", "cmtContent": ":國民黨從來都沒有好好規劃建設,只會偷拐騙", "cmtDate": "2021-05-08 23:23:00"}, {"cmtStatus": "噓", "cmtPoster": "scoop", "cmtContent": ":皇奴", "cmtDate": "2021-05-08 23:24:00"}, {"cmtStatus": "→", "cmtPoster": "NMNS20th", "cmtContent": ":國民黨確實沒有資進黨會規劃,看看把持輿論的資進黨,", "cmtDate": "2021-05-08 23:25:00"}, {"cmtStatus": "→", "cmtPoster": "NMNS20th", "cmtContent": ":有誰比他會規劃?", "cmtDate": "2021-05-08 23:25:00"}, {"cmtStatus": "噓", "cmtPoster": "andy199113", "cmtContent": ":台積電十大建設=>國民黨執政", "cmtDate": "2021-05-08 23:26:00"}, {"cmtStatus": "推", "cmtPoster": "lkdsa", "cmtContent": ":日本那時是真以為台灣會是永久領土才砸那麼多錢,這建設要", "cmtDate": "2021-05-08 23:29:00"}, {"cmtStatus": "→", "cmtPoster": "lkdsa", "cmtContent": ":問農民意見,當地農民會感念八田的,雕像還被藏起來保護過", "cmtDate": "2021-05-08 23:29:00"}, {"cmtStatus": "→", "cmtPoster": "Brown1010", "cmtContent": ":謝謝日本強拉台灣人當礦工軍伕", "cmtDate": "2021-05-08 23:29:00"}, {"cmtStatus": "→", "cmtPoster": "lkdsa", "cmtContent": ":雖然殖民,但日本這點不得不說厲害,這種建設能用那麼長", "cmtDate": "2021-05-08 23:31:00"}, {"cmtStatus": "→", "cmtPoster": "lkdsa", "cmtContent": ":久多年", "cmtDate": "2021-05-08 23:31:00"}, {"cmtStatus": "推", "cmtPoster": "taso5566", "cmtContent": ":真的要感謝日本,讓台灣人保有殖民地的奴性", "cmtDate": "2021-05-08 23:32:00"}, {"cmtStatus": "→", "cmtPoster": "lkdsa", "cmtContent": ":kmt戰時到處拉中國男丁去打仗不是...雖然那是中國人", "cmtDate": "2021-05-08 23:32:00"}, {"cmtStatus": "→", "cmtPoster": "monkeydog119", "cmtContent": ":苔派嚼菸認為慰安婦是自願的QQ", "cmtDate": "2021-05-08 23:43:00"}, {"cmtStatus": "推", "cmtPoster": "cc161", "cmtContent": ":日本人只有做表面而已啦,還不是很排外", "cmtDate": "2021-05-08 23:44:00"}, {"cmtStatus": "推", "cmtPoster": "water91", "cmtContent": ":謝謝kmt及中國人夠爛,讓我們一直感念日本", "cmtDate": "2021-05-08 23:45:00"}, {"cmtStatus": "→", "cmtPoster": "monkeydog119", "cmtContent": ":台灣抗日先烈現在應該會被苔派嚼菸罵吧QQ", "cmtDate": "2021-05-08 23:46:00"}, {"cmtStatus": "推", "cmtPoster": "aja1008", "cmtContent": ":八卦四趴五毛白蠊教柯憨糞崩潰啦!", "cmtDate": "2021-05-09 00:00:00"}, {"cmtStatus": "噓", "cmtPoster": "kbten", "cmtContent": ":很厲害哦,呵呵", "cmtDate": "2021-05-09 00:16:00"}, {"cmtStatus": "→", "cmtPoster": "kbten", "cmtContent": ":以為日本人看得起台灣人喔,呵呵", "cmtDate": "2021-05-09 00:17:00"}, {"cmtStatus": "推", "cmtPoster": "houting", "cmtContent": ":ㄎ韓五毛們很會借題發揮ㄎㄎ笑翻", "cmtDate": "2021-05-09 00:31:00"}, {"cmtStatus": "推", "cmtPoster": "ioupoiu", "cmtContent": ":台灣才要感謝日本吧", "cmtDate": "2021-05-09 00:46:00"}, {"cmtStatus": "推", "cmtPoster": "orze04", "cmtContent": ":南部人珍惜使用??", "cmtDate": "2021-05-09 00:49:00"}, {"cmtStatus": "推", "cmtPoster": "fransiceyho", "cmtContent": ":台日友好!", "cmtDate": "2021-05-09 01:13:00"}, {"cmtStatus": "噓", "cmtPoster": "gusser", "cmtContent": ":日本人殺了幾十萬台灣人然後被當神拜可悲", "cmtDate": "2021-05-09 09:28:00"}, {"cmtStatus": "推", "cmtPoster": "gliga", "cmtContent": ":支那賤畜都不遮掩啦", "cmtDate": "2021-05-09 12:12:00"}, {"cmtStatus": "噓", "cmtPoster": "darkholy", "cmtContent": ":曾文水庫,南化水庫吱吱還不是用爽爽,誰蓋的完全不感", "cmtDate": "2021-05-09 12:24:00"}, {"cmtStatus": "→", "cmtPoster": "darkholy", "cmtContent": ":謝", "cmtDate": "2021-05-09 12:24:00"}, {"cmtStatus": "噓", "cmtPoster": "hohoman", "cmtContent": ":建設台灣只是作為長期侵略東南亞的跳板,不然怎會有神靖", "cmtDate": "2021-05-09 12:36:00"}, {"cmtStatus": "→", "cmtPoster": "hohoman", "cmtContent": ":丸號事件,折損這麼多台灣醫界菁英。", "cmtDate": "2021-05-09 12:36:00"}]
## 5 [{"cmtStatus": "→", "cmtPoster": "a27588679", "cmtContent": ":先給他們時間開會", "cmtDate": "2021-05-10 01:09:00"}, {"cmtStatus": "噓", "cmtPoster": "sonicyang", "cmtContent": ":誰理你", "cmtDate": "2021-05-10 01:11:00"}, {"cmtStatus": "噓", "cmtPoster": "kyozwhie", "cmtContent": ":先叫他們不要舔人民幣阿", "cmtDate": "2021-05-10 01:11:00"}, {"cmtStatus": "→", "cmtPoster": "WunoW", "cmtContent": ":日本人本來就對政令無感了他們大多是只聽公司命令的社畜", "cmtDate": "2021-05-10 01:12:00"}, {"cmtStatus": "→", "cmtPoster": "WunoW", "cmtContent": ":其實就政府逼企業督導員工防疫就有效了", "cmtDate": "2021-05-10 01:13:00"}, {"cmtStatus": "→", "cmtPoster": "WunoW", "cmtContent": ":不要說政府日本年輕人很多對皇室不屑在公司倒是跟狗一樣", "cmtDate": "2021-05-10 01:15:00"}, {"cmtStatus": "推", "cmtPoster": "OPPAISuki", "cmtContent": ":幹嘛勸?清除一些老人不好嗎?", "cmtDate": "2021-05-10 01:19:00"}, {"cmtStatus": "推", "cmtPoster": "jerrylin", "cmtContent": ":死的人夠多大概就會了", "cmtDate": "2021-05-10 01:20:00"}, {"cmtStatus": "→", "cmtPoster": "jerrylin", "cmtContent": ":如果連死都不怕你幫他們窮緊張幹嘛?", "cmtDate": "2021-05-10 01:21:00"}, {"cmtStatus": "→", "cmtPoster": "jerrylin", "cmtContent": ":反正只有死掉跟痊癒後有抗體兩種選項", "cmtDate": "2021-05-10 01:21:00"}, {"cmtStatus": "→", "cmtPoster": "j73596", "cmtContent": ":日本人要怎麼勸臺灣人才會尊守交通規則?", "cmtDate": "2021-05-10 01:34:00"}, {"cmtStatus": "推", "cmtPoster": "movieghost", "cmtContent": ":不用勸啊整個東奧虧的錢當繳學費囉", "cmtDate": "2021-05-10 01:40:00"}, {"cmtStatus": "推", "cmtPoster": "wyverns", "cmtContent": ":每次聽到濃密接觸都會看到石原里美畫白板的樣子", "cmtDate": "2021-05-10 01:41:00"}, {"cmtStatus": "噓", "cmtPoster": "jiunliege", "cmtContent": ":跟印度比差遠了", "cmtDate": "2021-05-10 01:42:00"}, {"cmtStatus": "推", "cmtPoster": "vking223", "cmtContent": ":乖乖在家濃密", "cmtDate": "2021-05-10 05:42:00"}, {"cmtStatus": "→", "cmtPoster": "GABA", "cmtContent": ":禁電車", "cmtDate": "2021-05-10 10:11:00"}]
## 6 [{"cmtStatus": "推", "cmtPoster": "NewCop", "cmtContent": ":中國人唯一的原則就是沒有原則", "cmtDate": "2021-05-10 08:58:00"}, {"cmtStatus": "→", "cmtPoster": "wolver", "cmtContent": ":奶子!!!", "cmtDate": "2021-05-10 08:58:00"}, {"cmtStatus": "推", "cmtPoster": "belucky", "cmtContent": ":玻璃心不要碎掉為原則", "cmtDate": "2021-05-10 08:58:00"}, {"cmtStatus": "→", "cmtPoster": "Dia149", "cmtContent": ":留島不留人", "cmtDate": "2021-05-10 08:59:00"}, {"cmtStatus": "推", "cmtPoster": "c1951", "cmtContent": ":台灣是中國不可分割的一部分阿", "cmtDate": "2021-05-10 09:00:00"}, {"cmtStatus": "推", "cmtPoster": "deolinwind", "cmtContent": ":貪財怕死好面子,再加個狼性", "cmtDate": "2021-05-10 09:00:00"}, {"cmtStatus": "推", "cmtPoster": "uanniy", "cmtContent": ":愛國", "cmtDate": "2021-05-10 09:00:00"}, {"cmtStatus": "推", "cmtPoster": "c1951", "cmtContent": ":天大地大共產黨最大爹親娘親不如蔡總統最親", "cmtDate": "2021-05-10 09:05:00"}, {"cmtStatus": "推", "cmtPoster": "melike671", "cmtContent": ":就玩個遊戲也怕被舉報啊,中國原則就是舉報", "cmtDate": "2021-05-10 09:20:00"}, {"cmtStatus": "→", "cmtPoster": "yycbr", "cmtContent": ":一個省怎麼能跟國家友好?不對等", "cmtDate": "2021-05-10 09:22:00"}, {"cmtStatus": "推", "cmtPoster": "wewants", "cmtContent": ":原則跟9.24%仔一樣高的拜低的踩欺善怕惡", "cmtDate": "2021-05-10 10:06:00"}, {"cmtStatus": "推", "cmtPoster": "CORSA", "cmtContent": ":若要擺平北京得先搞定普京", "cmtDate": "2021-05-10 13:13:00"}]
## e_ip insertedDate dataSource
## 1 24.4.155.116 2021-05-02 00:57:46 ptt
## 2 220.132.75.21 2021-05-04 01:07:20 ptt
## 3 49.216.229.0 2021-05-08 00:25:54 ptt
## 4 1.162.215.248 2021-05-09 00:31:03 ptt
## 5 36.231.147.179 2021-05-11 00:22:11 ptt
## 6 1.200.139.142 2021-05-11 00:23:11 ptt
head(reviews)## cmtStatus cmtPoster cmtContent
## 1 推 hipmyhop :台灣現在就是這樣啊對變種病毒太掉以輕心
## 2 → hipmyhop :祈禱五月最好沒事否則大家準備飛高高
## 3 → Despairile :現在就是彈性疲乏除非一口氣死亡率提高到50%
## 4 → Despairile :甚至每天都是死幾百幾千才有可能恐慌
## 5 → Despairile :看資本市場的資金流通/民眾的應變就知道了
## 6 → Despairile :單純是防到彈性疲乏畢竟死亡率沒有民眾認知來的可怕
## cmtDate artPoster
## 1 2021-05-01 05:28:00 moonshade
## 2 2021-05-01 05:29:00 moonshade
## 3 2021-05-01 05:32:00 moonshade
## 4 2021-05-01 05:32:00 moonshade
## 5 2021-05-01 05:33:00 moonshade
## 6 2021-05-01 05:34:00 moonshade
## artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 4 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 5 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 6 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## artTitle artDate
## 1 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01
## 2 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01
## 3 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01
## 4 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01
## 5 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01
## 6 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01
posts %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate) %>%
summarise(count = n())%>%
ggplot(aes(artDate,count))+
geom_line(color="red")+
geom_point() # #文章斷句("\n\n"取代成"。")
# mask_meta <- posts %>%
# mutate(artContent=gsub("[\n]{2,}", "。", artContent))
# #以全形或半形 驚歎號、問號、分號 以及 全形句號 爲依據進行斷句
# mask_sentences <- strsplit(mask_meta$artContent,"[。!;?!?;]")
# # 將每句句子,與他所屬的文章連結配對起來,整理成一個dataframe
# mask_sentences <- data.frame(
# artUrl = rep(mask_meta$artUrl, sapply(mask_sentences, length)),
# artContent = unlist(mask_sentences)
# ) %>%
# filter(!str_detect(artContent, regex("^(\t|\n| )*$")))
# #如果有\t或\n就去掉
# mask_sentences$artContent <- as.character(mask_sentences$artContent)# ## 文章斷詞
# # load mask_lexicon(特定要斷開的詞,像是user_dict)
# mask_lexicon <- scan(file = "./dict/mask_lexicon_19.txt", what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
# # load stop words
# stop_words <- scan(file = "./dict/stop_words.txt", what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
#
# # 使用默認參數初始化一個斷詞引擎
# jieba_tokenizer = worker()
#
# # 使用口罩字典重新斷詞
# #new_user_word(jieba_tokenizer, c(mask_lexicon_19))
#
# # tokenize function
# chi_tokenizer <- function(t) {
# lapply(t, function(x) {
# if(nchar(x)>1){
# tokens <- segment(x, jieba_tokenizer)
# tokens <- tokens[!tokens %in% stop_words]
# # 去掉字串長度爲1的詞彙
# tokens <- tokens[nchar(tokens)>1]
# return(tokens)
# }
# })
# }
#
#
# # 用剛剛初始化的斷詞器把sentence斷開
# tokens <- mask_sentences %>%
# mutate(artContent = gsub("[[:punct:]]", "",artContent)) %>%
# mutate(artContent = gsub("[0-9a-zA-Z]", "",artContent)) %>%
# unnest_tokens(word, artContent, token=chi_tokenizer) %>%
# count(artUrl, word) %>% # 計算每篇文章出現的字頻
# rename(count=n)
# tokens
# save.image(file = "./data/token_result.rdata") load("./data/token_result.rdata")清理斷詞結果
。根據詞頻,選擇只出現3字以上的字 。整理成url,word,n的格式之後,就可以轉dtm
P.S. groupby by之後原本的字詞結構會不見,把詞頻另存在一個reserved_word裡面
freq = 3
# 依據字頻挑字
reserved_word <- tokens %>%
group_by(word) %>%
count() %>%
filter(n > freq) %>%
unlist()
mask_removed <- tokens %>%
filter(word %in% reserved_word)
#mask_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
mask_dtm <- mask_removed %>% cast_dtm(artUrl, word, count) mask_lda <- LDA(mask_dtm, k = 2, control = list(seed = 123))
# lda <- LDA(dtm, k = 2, control = list(seed = 2021,alpha = 2,delta=0.1),method = "Gibbs") #調整alpha即delta
mask_lda## A LDA_VEM topic model with 2 topics.
topics_words <- tidy(mask_lda, matrix = "beta") # 注意,在tidy function裡面要使用"beta"來取出Phi矩陣。
colnames(topics_words) <- c("topic", "term", "phi")
topics_words## # A tibble: 10,258 x 3
## topic term phi
## <int> <chr> <dbl>
## 1 1 辦法 1.34e- 4
## 2 2 辦法 1.84e- 3
## 3 1 爆出 1.20e- 7
## 4 2 爆出 9.85e- 5
## 5 1 變異 4.01e-14
## 6 2 變異 1.27e- 4
## 7 1 採取 2.32e- 4
## 8 2 採取 9.37e- 5
## 9 1 傳染 7.47e- 7
## 10 2 傳染 3.51e- 4
## # … with 10,248 more rows
library(showtext) #安装此包前需要在mac中按照XQuartz,link在下文## Loading required package: sysfonts
## Loading required package: showtextdb
showtext_auto()
font_add("PingFangSC-Regular",regular = "/System/Library/Fonts/PingFang.ttc") #第一个参数是根据字体随便取个名字,regular参数是相应字体在电脑中的文件terms依照各主題的phi值由大到小排序,列出前10大
topics_words %>%
group_by(topic) %>%
top_n(10, phi) %>%
ungroup() %>%
mutate(top_words = reorder_within(term,phi,topic)) %>%
ggplot(aes(x = top_words, y = phi, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()嘗試2、4、6、10、15個主題數,將結果存起來,再做進一步分析。 此部分需要跑一段時間,已經將跑完的檔案存成ldas_result_19.rdata,可以直接載入
# ldas = c()
# topics = c(2,4,6,10,15)
# for(topic in topics){
# start_time <- Sys.time()
# mask_lda <- LDA(mask_dtm, k = topic, control = list(seed = 123))
# ldas =c(ldas,mask_lda)
# print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
# save(ldas,file = "./data/ldas_result_19.rdata") # 將模型輸出成檔案
# }載入每個主題的LDA結果
load("data/ldas_result_19.rdata")topics = c(2,4,6,10,15)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
# LDA分成4個主題
mask_lda <- LDA(mask_dtm, k = 4, control = list(seed = 123))p.s. 。tidy(mask_lda, matrix = “beta”) # 取字 topic term beta值 。tidy(mask_lda, matrix=“gamma”) # 取主題 document topic gamma
removed_word = c("有沒有","好像","八卦","比較","一堆","覺得","看到","疫苗")
# 看各群的常用詞彙
tidy(mask_lda, matrix = "beta") %>% # 取出topic term beta值
filter(! term %in% removed_word) %>%
group_by(topic) %>%
top_n(10, beta) %>% # beta值前10的字
ungroup() %>%
mutate(topic = as.factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()create LDAvis所需的json function 此function是將前面使用 “LDA function”所建立的model,轉換為“LDAVis”套件的input格式。
topicmodels_json_ldavis <- function(fitted, doc_term){
require(LDAvis)
require(slam)
###以下function 用來解決,主題數多會出現NA的問題
### 參考 https://github.com/cpsievert/LDAvis/commit/c7234d71168b1e946a361bc00593bc5c4bf8e57e
ls_LDA = function (phi){
jensenShannon <- function(x, y) {
m <- 0.5 * (x + y)
lhs <- ifelse(x == 0, 0, x * (log(x) - log(m+1e-16)))
rhs <- ifelse(y == 0, 0, y * (log(y) - log(m+1e-16)))
0.5 * sum(lhs) + 0.5 * sum(rhs)
}
dist.mat <- proxy::dist(x = phi, method = jensenShannon)
pca.fit <- stats::cmdscale(dist.mat, k = 2)
data.frame(x = pca.fit[, 1], y = pca.fit[, 2])
}
# Find required quantities
phi <- as.matrix(posterior(fitted)$terms)
theta <- as.matrix(posterior(fitted)$topics)
vocab <- colnames(phi)
term_freq <- slam::col_sums(doc_term)
# Convert to json
json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
vocab = vocab,
doc.length = as.vector(table(doc_term$i)),
term.frequency = term_freq, mds.method = ls_LDA)
return(json_lda)
} # the_lda = ldas[[2]]
# json_res <- topicmodels_json_ldavis(the_lda,mask_dtm)
# serVis(json_res,open.browser = T)
可以歸納出
topic 1 = “日本疫苗提供給台灣的新聞”
topic 2 = “民進黨與疫苗的相關報導”
topic 3 = “民間團體捐贈的相關報導”
topic 4 = “美國可能捐贈的相關報導”
以下我們挑出第一個主題與第三個主題來做比較。
每篇文章拿gamma值最大的topic當該文章的topic
# 在tidy function中使用參數"gamma"來取得 theta矩陣
mask_topics <- tidy(mask_lda, matrix="gamma") %>% # document topic gamma
group_by(document) %>%
top_n(1, wt=gamma)
mask_topics## # A tibble: 1,461 x 3
## # Groups: document [1,461]
## document topic gamma
## <chr> <int> <dbl>
## 1 https://www.ptt.cc/bbs/Gossiping/M.1620360816.A.B31.html 1 0.723
## 2 https://www.ptt.cc/bbs/Gossiping/M.1620486461.A.35D.html 1 0.836
## 3 https://www.ptt.cc/bbs/Gossiping/M.1621401240.A.6F3.html 1 0.555
## 4 https://www.ptt.cc/bbs/Gossiping/M.1621524294.A.C63.html 1 0.616
## 5 https://www.ptt.cc/bbs/Gossiping/M.1621601044.A.7DA.html 1 0.486
## 6 https://www.ptt.cc/bbs/Gossiping/M.1621680649.A.4D0.html 1 0.630
## 7 https://www.ptt.cc/bbs/Gossiping/M.1621783706.A.7BC.html 1 0.658
## 8 https://www.ptt.cc/bbs/Gossiping/M.1621903751.A.345.html 1 0.540
## 9 https://www.ptt.cc/bbs/Gossiping/M.1622015313.A.7E0.html 1 0.376
## 10 https://www.ptt.cc/bbs/Gossiping/M.1622015342.A.E3D.html 1 0.533
## # … with 1,451 more rows
posts_topic <- merge(x = posts, y = mask_topics, by.x = "artUrl", by.y="document")
# 看一下各主題在說甚麼
set.seed(123)
posts_topic %>% # 主題一
filter(topic==1) %>%
select(artTitle) %>%
unique() %>%
sample_n(5)## artTitle
## 1 Re:[黑特]楊寶楨出來幫國台辦說話耶
## 2 [討論]黃暐瀚:日本沒人要打AZ,那何不給台灣?
## 3 Re:[問卦]如果疫苗換核食大家同意嗎?
## 4 [新聞]感謝日本捐贈疫苗台北101點燈致意
## 5 [討論]從日本給疫苗看日本在太平洋局勢
posts_topic %>% # 主題三
filter(topic==3) %>%
select(artTitle) %>%
unique() %>%
sample_n(5)## artTitle
## 1 [問卦]這樣不是快多了?
## 2 Re:[黑特]民進黨這波疫苗策略蠢到炸裂
## 3 [討論]莊人祥說實話:我們沒EUA這會變成藥廠理由
## 4 [新聞]郭台銘想買1千萬劑疫苗捐政府子弟兵立委
## 5 [新聞]楊志良怒了!「蔡英文領導防疫台灣會亡國」 籲電話灌爆總統
這次我們把討論焦點放在日本贈送疫苗上,從主題分布大概可以看到兩類觀點:
主題一: > 對於日本政府願意贈送疫苗給台灣的新聞,如「正妹議員感謝日本贈疫苗」、「網傳日供台疫苗內幕」、「124萬劑AZ疫苗日航專機起飛赴台」、「日本政府無償提供的疫苗 」
主題三: > 大部分是對於台灣民間團體捐贈疫苗或採購的討論,對於佛光山或郭台銘等相關討論,如「佛光山想捐疫苗」、「郭董被騙還是被玩?」、「民間自購疫苗的議題」
畫出每天topic的分布,發現隨著時間增加,八卦版及政黑版主題一、三的比例逐漸增加, 且主題三的比例大於主題一。
posts_topic %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate,topic) %>%
summarise(sum =sum(topic)) %>%
ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
geom_col(position="fill") ## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
posts_topic %>%
group_by(artCatagory,topic) %>%
summarise(sum = n()) %>%
ggplot(aes(x= artCatagory,y=sum,fill=as.factor(topic))) +
geom_col(position="dodge") ## `summarise()` has grouped output by 'artCatagory'. You can override using the `.groups` argument.
# 文章和留言
reviews <- reviews %>%
select(artUrl, cmtPoster, cmtStatus, cmtContent)
posts_Reviews <- merge(x = posts, y = reviews, by = "artUrl")
# 把文章和topic
posts_Reviews <- merge(x = posts_Reviews, y = mask_topics, by.x = "artUrl", by.y="document")
head(posts_Reviews,3)## artUrl system_id
## 1 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html 1
## 2 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html 1
## 3 https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html 1
## artTitle artDate artPoster
## 1 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01 05:22:04 moonshade
## 2 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01 05:22:04 moonshade
## 3 Re:[問卦]鄉民常吹捧日本人嚴謹那他們疫情怎爆掉 2021-05-01 05:22:04 moonshade
## artCatagory
## 1 Gossiping
## 2 Gossiping
## 3 Gossiping
## artContent
## 1 日本新聞講的觀點是\n\n 很多公司不願意或沒有辦法讓員工在家上班,員工\n 來上班了就會有聚會,沒辦法他們的習慣就是這樣,\n 尤其是上司要求了無法拒絕,大阪府的公務員就是\n 去聚會整個單位都中,身為公家機關沒辦法以身作\n 則已經被罵到翻掉\n\n 再來是經過一年彈性疲乏,在春季的時候因為沒有\n 緊急事態很多人都照常活動,那些宣導都是柔性\n 並沒有強制很多店家都開放店內用餐,採取的\n 一些措施但是都是做心安的,像是測量二氧化碳\n 然後開窗換氣,這還是東京都,外面的縣市大多\n 都沒在管照常一般營業,尤其是這次先掛掉的\n 大阪,一開始大阪掛掉東京鄉民還在嘲笑他們\n 過沒多久東京也跟著爆了\n\n 新型變異種對年輕人傳染力高,有好幾個小學\n 都爆出幾乎全班中,都是中新型變異種,然後\n 日本疫苗現在也只有開始打老人,線上登記問題一堆\n 連公務員都不會用,現場排隊登記大排長龍\n\n\n
## 2 日本新聞講的觀點是\n\n 很多公司不願意或沒有辦法讓員工在家上班,員工\n 來上班了就會有聚會,沒辦法他們的習慣就是這樣,\n 尤其是上司要求了無法拒絕,大阪府的公務員就是\n 去聚會整個單位都中,身為公家機關沒辦法以身作\n 則已經被罵到翻掉\n\n 再來是經過一年彈性疲乏,在春季的時候因為沒有\n 緊急事態很多人都照常活動,那些宣導都是柔性\n 並沒有強制很多店家都開放店內用餐,採取的\n 一些措施但是都是做心安的,像是測量二氧化碳\n 然後開窗換氣,這還是東京都,外面的縣市大多\n 都沒在管照常一般營業,尤其是這次先掛掉的\n 大阪,一開始大阪掛掉東京鄉民還在嘲笑他們\n 過沒多久東京也跟著爆了\n\n 新型變異種對年輕人傳染力高,有好幾個小學\n 都爆出幾乎全班中,都是中新型變異種,然後\n 日本疫苗現在也只有開始打老人,線上登記問題一堆\n 連公務員都不會用,現場排隊登記大排長龍\n\n\n
## 3 日本新聞講的觀點是\n\n 很多公司不願意或沒有辦法讓員工在家上班,員工\n 來上班了就會有聚會,沒辦法他們的習慣就是這樣,\n 尤其是上司要求了無法拒絕,大阪府的公務員就是\n 去聚會整個單位都中,身為公家機關沒辦法以身作\n 則已經被罵到翻掉\n\n 再來是經過一年彈性疲乏,在春季的時候因為沒有\n 緊急事態很多人都照常活動,那些宣導都是柔性\n 並沒有強制很多店家都開放店內用餐,採取的\n 一些措施但是都是做心安的,像是測量二氧化碳\n 然後開窗換氣,這還是東京都,外面的縣市大多\n 都沒在管照常一般營業,尤其是這次先掛掉的\n 大阪,一開始大阪掛掉東京鄉民還在嘲笑他們\n 過沒多久東京也跟著爆了\n\n 新型變異種對年輕人傳染力高,有好幾個小學\n 都爆出幾乎全班中,都是中新型變異種,然後\n 日本疫苗現在也只有開始打老人,線上登記問題一堆\n 連公務員都不會用,現場排隊登記大排長龍\n\n\n
## artComment
## 1 [{"cmtStatus": "推", "cmtPoster": "hipmyhop", "cmtContent": ":台灣現在就是這樣啊對變種病毒太掉以輕心", "cmtDate": "2021-05-01 05:28:00"}, {"cmtStatus": "→", "cmtPoster": "hipmyhop", "cmtContent": ":祈禱五月最好沒事否則大家準備飛高高", "cmtDate": "2021-05-01 05:29:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":現在就是彈性疲乏除非一口氣死亡率提高到50%", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":甚至每天都是死幾百幾千才有可能恐慌", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":看資本市場的資金流通/民眾的應變就知道了", "cmtDate": "2021-05-01 05:33:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":單純是防到彈性疲乏畢竟死亡率沒有民眾認知來的可怕", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":今天又不是說100人很容易感染60+然後感染者又死亡率", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":輕易超越50%以上所以彈性疲乏也很正常", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "iampig951753", "cmtContent": ":人為財死如果這你還不懂就是沒有慧根", "cmtDate": "2021-05-01 05:40:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":本來就是死亡率又不高,整天綁手綁腳的不覺得CP值很低嗎.", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":..", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "snow3804", "cmtContent": ":沒那摸倒楣", "cmtDate": "2021-05-01 06:14:00"}, {"cmtStatus": "→", "cmtPoster": "annielu", "cmtContent": ":噓某樓,死亡率不高你去得啊!有後遺症不要來又哭爸哭媽", "cmtDate": "2021-05-01 06:19:00"}, {"cmtStatus": "噓", "cmtPoster": "clerkhsiao", "cmtContent": ":住美國對日本這麼了解?", "cmtDate": "2021-05-01 06:49:00"}, {"cmtStatus": "→", "cmtPoster": "moonshade", "cmtContent": ":我都看日本新聞練習日文", "cmtDate": "2021-05-01 06:52:00"}, {"cmtStatus": "推", "cmtPoster": "ifyoutry", "cmtContent": ":數學不錯喔還知道死亡率不高,以2%死亡率來計算假設台灣", "cmtDate": "2021-05-01 07:27:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":都一年多了怎麼還有人覺得死亡率不高所以沒關係這個病", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":的特點在於傳染力極強一旦爆發會瞬間癱瘓醫療系統到", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":時連得其他病的人也會死掉", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":人有八成的人得到那大概會死三十七萬人,大概就噴掉一個", "cmtDate": "2021-05-01 07:29:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":基隆市的人口而已.", "cmtDate": "2021-05-01 07:31:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":當然講八成是危言聳聽啦,但即使取十分之一也得死三萬多", "cmtDate": "2021-05-01 07:35:00"}, {"cmtStatus": "噓", "cmtPoster": "takasaki", "cmtContent": ":死亡率不到2%好嗎...", "cmtDate": "2021-05-01 08:41:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":就算2%都不高阿....何況不到2%", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":要讓我中就中阿反正基本死不了", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":整天搞東搞西到底是有完沒完,70億人2%也就1.4億人而已還", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":可以降低資源需求的壓力", "cmtDate": "2021-05-01 13:17:00"}]
## 2 [{"cmtStatus": "推", "cmtPoster": "hipmyhop", "cmtContent": ":台灣現在就是這樣啊對變種病毒太掉以輕心", "cmtDate": "2021-05-01 05:28:00"}, {"cmtStatus": "→", "cmtPoster": "hipmyhop", "cmtContent": ":祈禱五月最好沒事否則大家準備飛高高", "cmtDate": "2021-05-01 05:29:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":現在就是彈性疲乏除非一口氣死亡率提高到50%", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":甚至每天都是死幾百幾千才有可能恐慌", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":看資本市場的資金流通/民眾的應變就知道了", "cmtDate": "2021-05-01 05:33:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":單純是防到彈性疲乏畢竟死亡率沒有民眾認知來的可怕", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":今天又不是說100人很容易感染60+然後感染者又死亡率", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":輕易超越50%以上所以彈性疲乏也很正常", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "iampig951753", "cmtContent": ":人為財死如果這你還不懂就是沒有慧根", "cmtDate": "2021-05-01 05:40:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":本來就是死亡率又不高,整天綁手綁腳的不覺得CP值很低嗎.", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":..", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "snow3804", "cmtContent": ":沒那摸倒楣", "cmtDate": "2021-05-01 06:14:00"}, {"cmtStatus": "→", "cmtPoster": "annielu", "cmtContent": ":噓某樓,死亡率不高你去得啊!有後遺症不要來又哭爸哭媽", "cmtDate": "2021-05-01 06:19:00"}, {"cmtStatus": "噓", "cmtPoster": "clerkhsiao", "cmtContent": ":住美國對日本這麼了解?", "cmtDate": "2021-05-01 06:49:00"}, {"cmtStatus": "→", "cmtPoster": "moonshade", "cmtContent": ":我都看日本新聞練習日文", "cmtDate": "2021-05-01 06:52:00"}, {"cmtStatus": "推", "cmtPoster": "ifyoutry", "cmtContent": ":數學不錯喔還知道死亡率不高,以2%死亡率來計算假設台灣", "cmtDate": "2021-05-01 07:27:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":都一年多了怎麼還有人覺得死亡率不高所以沒關係這個病", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":的特點在於傳染力極強一旦爆發會瞬間癱瘓醫療系統到", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":時連得其他病的人也會死掉", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":人有八成的人得到那大概會死三十七萬人,大概就噴掉一個", "cmtDate": "2021-05-01 07:29:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":基隆市的人口而已.", "cmtDate": "2021-05-01 07:31:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":當然講八成是危言聳聽啦,但即使取十分之一也得死三萬多", "cmtDate": "2021-05-01 07:35:00"}, {"cmtStatus": "噓", "cmtPoster": "takasaki", "cmtContent": ":死亡率不到2%好嗎...", "cmtDate": "2021-05-01 08:41:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":就算2%都不高阿....何況不到2%", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":要讓我中就中阿反正基本死不了", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":整天搞東搞西到底是有完沒完,70億人2%也就1.4億人而已還", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":可以降低資源需求的壓力", "cmtDate": "2021-05-01 13:17:00"}]
## 3 [{"cmtStatus": "推", "cmtPoster": "hipmyhop", "cmtContent": ":台灣現在就是這樣啊對變種病毒太掉以輕心", "cmtDate": "2021-05-01 05:28:00"}, {"cmtStatus": "→", "cmtPoster": "hipmyhop", "cmtContent": ":祈禱五月最好沒事否則大家準備飛高高", "cmtDate": "2021-05-01 05:29:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":現在就是彈性疲乏除非一口氣死亡率提高到50%", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":甚至每天都是死幾百幾千才有可能恐慌", "cmtDate": "2021-05-01 05:32:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":看資本市場的資金流通/民眾的應變就知道了", "cmtDate": "2021-05-01 05:33:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":單純是防到彈性疲乏畢竟死亡率沒有民眾認知來的可怕", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":今天又不是說100人很容易感染60+然後感染者又死亡率", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "Despairile", "cmtContent": ":輕易超越50%以上所以彈性疲乏也很正常", "cmtDate": "2021-05-01 05:34:00"}, {"cmtStatus": "→", "cmtPoster": "iampig951753", "cmtContent": ":人為財死如果這你還不懂就是沒有慧根", "cmtDate": "2021-05-01 05:40:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":本來就是死亡率又不高,整天綁手綁腳的不覺得CP值很低嗎.", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":..", "cmtDate": "2021-05-01 06:01:00"}, {"cmtStatus": "→", "cmtPoster": "snow3804", "cmtContent": ":沒那摸倒楣", "cmtDate": "2021-05-01 06:14:00"}, {"cmtStatus": "→", "cmtPoster": "annielu", "cmtContent": ":噓某樓,死亡率不高你去得啊!有後遺症不要來又哭爸哭媽", "cmtDate": "2021-05-01 06:19:00"}, {"cmtStatus": "噓", "cmtPoster": "clerkhsiao", "cmtContent": ":住美國對日本這麼了解?", "cmtDate": "2021-05-01 06:49:00"}, {"cmtStatus": "→", "cmtPoster": "moonshade", "cmtContent": ":我都看日本新聞練習日文", "cmtDate": "2021-05-01 06:52:00"}, {"cmtStatus": "推", "cmtPoster": "ifyoutry", "cmtContent": ":數學不錯喔還知道死亡率不高,以2%死亡率來計算假設台灣", "cmtDate": "2021-05-01 07:27:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":都一年多了怎麼還有人覺得死亡率不高所以沒關係這個病", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":的特點在於傳染力極強一旦爆發會瞬間癱瘓醫療系統到", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "yjchiou", "cmtContent": ":時連得其他病的人也會死掉", "cmtDate": "2021-05-01 07:28:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":人有八成的人得到那大概會死三十七萬人,大概就噴掉一個", "cmtDate": "2021-05-01 07:29:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":基隆市的人口而已.", "cmtDate": "2021-05-01 07:31:00"}, {"cmtStatus": "→", "cmtPoster": "ifyoutry", "cmtContent": ":當然講八成是危言聳聽啦,但即使取十分之一也得死三萬多", "cmtDate": "2021-05-01 07:35:00"}, {"cmtStatus": "噓", "cmtPoster": "takasaki", "cmtContent": ":死亡率不到2%好嗎...", "cmtDate": "2021-05-01 08:41:00"}, {"cmtStatus": "噓", "cmtPoster": "cefywo", "cmtContent": ":就算2%都不高阿....何況不到2%", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":要讓我中就中阿反正基本死不了", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":整天搞東搞西到底是有完沒完,70億人2%也就1.4億人而已還", "cmtDate": "2021-05-01 13:17:00"}, {"cmtStatus": "→", "cmtPoster": "cefywo", "cmtContent": ":可以降低資源需求的壓力", "cmtDate": "2021-05-01 13:17:00"}]
## e_ip insertedDate dataSource cmtPoster cmtStatus
## 1 24.4.155.116 2021-05-02 00:57:46 ptt takasaki 噓
## 2 24.4.155.116 2021-05-02 00:57:46 ptt hipmyhop 推
## 3 24.4.155.116 2021-05-02 00:57:46 ptt hipmyhop →
## cmtContent topic gamma
## 1 :死亡率不到2%好嗎... 2 0.9952704
## 2 :台灣現在就是這樣啊對變種病毒太掉以輕心 2 0.9952704
## 3 :祈禱五月最好沒事否則大家準備飛高高 2 0.9952704
取出 cmtPoster(回覆者)、artPoster(發文者)、artUrl(文章連結) 三個欄位
link <- posts_Reviews %>% select(cmtPoster, artPoster, artUrl)
head(link,3)## cmtPoster artPoster artUrl
## 1 takasaki moonshade https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 2 hipmyhop moonshade https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
## 3 hipmyhop moonshade https://www.ptt.cc/bbs/Gossiping/M.1619817726.A.4B2.html
reviewNetwork <- graph_from_data_frame(d=link, directed=T)
reviewNetwork## IGRAPH 3c8f6fa DN-- 23648 146792 --
## + attr: name (v/c), artUrl (e/c)
## + edges from 3c8f6fa (vertex names):
## [1] takasaki ->moonshade hipmyhop ->moonshade hipmyhop ->moonshade
## [4] Despairile ->moonshade Despairile ->moonshade Despairile ->moonshade
## [7] Despairile ->moonshade Despairile ->moonshade Despairile ->moonshade
## [10] iampig951753->moonshade cefywo ->moonshade cefywo ->moonshade
## [13] cefywo ->moonshade cefywo ->moonshade clerkhsiao ->moonshade
## [16] moonshade ->moonshade ifyoutry ->moonshade yjchiou ->moonshade
## [19] yjchiou ->moonshade yjchiou ->moonshade ifyoutry ->moonshade
## [22] ifyoutry ->moonshade ifyoutry ->moonshade cefywo ->moonshade
## + ... omitted several edges
直接畫的話,因為點沒有經過篩選,看起來會密密麻麻的 還需要經過一次資料篩選,有興趣可以跑跑下面的code
# 畫出網路圖(密集恐懼警告)
# plot(reviewNetwork)
# plot(reviewNetwork, vertex.size=2, edge.arrow.size=.2,vertex.label=NA)
資料篩選的方式:
# # 看一下留言數大概都多少(方便後面篩選)
# posts %>%
# # filter(commentNum<100) %>%
# ggplot(aes(x=commentNum)) + geom_histogram()依據發文數或回覆數篩選post和review
# 帳號發文篇數
post_count = posts %>%
group_by(artPoster) %>%
summarise(count = n()) %>%
arrange(desc(count))
post_count## # A tibble: 1,072 x 2
## artPoster count
## <chr> <int>
## 1 kcbill 10
## 2 ismail 9
## 3 Pietro 8
## 4 imprezasti 7
## 5 sister4949 6
## 6 CavendishJr 5
## 7 dahanjian 5
## 8 Emacs 5
## 9 GingFreecss 5
## 10 haehae311444 5
## # … with 1,062 more rows
# 帳號回覆總數
review_count = reviews %>%
group_by(cmtPoster) %>%
summarise(count = n()) %>%
arrange(desc(count))
review_count## # A tibble: 23,278 x 2
## cmtPoster count
## <chr> <int>
## 1 TheoEpstein 612
## 2 Pietro 405
## 3 MyDice 403
## 4 win8719 378
## 5 friedpig 366
## 6 yufion 352
## 7 greedypeople 342
## 8 quiet93 331
## 9 demitri 309
## 10 vicious666 271
## # … with 23,268 more rows
# 發文者
poster_select <- post_count %>% filter(count >= 2)
posts <- posts %>% filter(posts$artPoster %in% poster_select$artPoster)
# 回覆者
reviewer_select <- review_count %>% filter(count >= 20)
reviews <- reviews %>% filter(reviews$cmtPoster %in% reviewer_select$cmtPoster)# 檢視參與人數
length(unique(posts_Reviews$artPoster)) # 發文者數量 1062## [1] 1062
length(unique(posts_Reviews$cmtPoster)) # 回覆者數量 23276## [1] 23276
allPoster <- c(posts_Reviews$artPoster, posts_Reviews$cmtPoster) # 總參與人數 23648
length(unique(allPoster))## [1] 23648
標記所有出現過得使用者
userList <- data.frame(user=unique(allPoster)) %>%
mutate(type=ifelse(user%in%posts$artPoster, "poster", "replyer"))
head(userList,3)## user type
## 1 moonshade replyer
## 2 zhwang2123 replyer
## 3 kurt911 replyer
事件是6/4爆發的,我們挑出當天的文章和回覆看看
link <- posts_Reviews %>%
group_by(cmtPoster, artUrl) %>%
filter(n()>3) %>%
filter(artCatagory=="Gossiping") %>%
filter(artDate == as.Date('2021-06-04')) %>%
select(cmtPoster, artPoster, artUrl) %>%
unique()
link## # A tibble: 295 x 3
## # Groups: cmtPoster, artUrl [295]
## cmtPoster artPoster artUrl
## <chr> <chr> <chr>
## 1 aprilsheep KZS https://www.ptt.cc/bbs/Gossiping/M.1622736133.A.D66.html
## 2 coolscott jasondaio https://www.ptt.cc/bbs/Gossiping/M.1622736352.A.5DB.html
## 3 u831208 kiddingsa https://www.ptt.cc/bbs/Gossiping/M.1622736757.A.14B.html
## 4 LtoM722 mmind1 https://www.ptt.cc/bbs/Gossiping/M.1622739043.A.A89.html
## 5 trywish protoss https://www.ptt.cc/bbs/Gossiping/M.1622742440.A.F2C.html
## 6 magiclibra protoss https://www.ptt.cc/bbs/Gossiping/M.1622742440.A.F2C.html
## 7 willism protoss https://www.ptt.cc/bbs/Gossiping/M.1622742440.A.F2C.html
## 8 protoss protoss https://www.ptt.cc/bbs/Gossiping/M.1622742440.A.F2C.html
## 9 returnfox forfunu https://www.ptt.cc/bbs/Gossiping/M.1622742838.A.969.html
## 10 mirza forfunu https://www.ptt.cc/bbs/Gossiping/M.1622742838.A.969.html
## # … with 285 more rows
篩選在link裡面有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
head(filtered_user,3)## user type
## 1 tysh710320 replyer
## 2 s9234032 replyer
## 3 Aotearoa replyer
這邊要篩選link中有出現的使用者,如果用沒篩過的userList(igraph中graph_from_data_frame的v參數吃的那個東西),圖上就會出現沒有在link裡面的nodes,圖片就會變得沒有意義
p.s.想要看會變怎麼樣的人可以跑下面的code
## 警告!有密集恐懼症的人請小心使用
# v = userList
# reviewNetwork <- graph_from_data_frame(d=link, v=userList, directed=T)
# plot(reviewNetwork, vertex.size=3, edge.arrow.size=0.3,vertex.label=NA)因爲圖片箭頭有點礙眼,所以這裏我們先把關係的方向性拿掉,減少圖片中的不必要的資訊 set.seed 因為igraph呈現的方向是隨機的
set.seed(487)
# v=filtered_user
reviewNetwork = degree(reviewNetwork) > 2
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
plot(reviewNetwork, vertex.size=3, edge.arrow.size=0.3,vertex.label=NA)用使用者的身份來區分點的顏色
set.seed(487)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
plot(reviewNetwork, vertex.size=3, edge.arrow.size=0.3,vertex.label=NA)可以稍微看出圖中的點(人)之間有一定的關聯,不過目前只有單純圖形我們無法分析其中的內容。
因此以下我們將資料集中的資訊加到我們的圖片中。
為點加上帳號名字,用degree篩選要顯示出的使用者,以免圖形被密密麻麻的文字覆蓋
filter_degree = 20
set.seed(123)
# 設定 node 的 label/ color
labels <- degree(reviewNetwork) # 算出每個點的degree
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
plot(
reviewNetwork,
vertex.size=5,
edge.width=5,
vertex.label.dist=3,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)我們可以看到基本的使用者關係,但是我們希望能夠將更進階的資訊視覺化。
例如:使用者經常參與的文章種類,或是使用者在該社群網路中是否受到歡迎。
挑選出2021-06-04當天的文章, 篩選一篇文章回覆3次以上者,且文章留言數多餘200則, 文章主題歸類為1(日本疫苗贈台)與3(民間團體捐贈)者, 欄位只取:cmtPoster(評論者), artPoster(發文者), artUrl(文章連結), topic(主題)
link <- posts_Reviews %>%
group_by(cmtPoster, artUrl) %>%
filter(n()>3) %>%
filter(artDate == as.Date('2021-06-04')) %>%
filter(topic == 1 | topic == 3) %>%
select(cmtPoster, artPoster, artUrl, topic) %>%
unique()
link## # A tibble: 280 x 4
## # Groups: cmtPoster, artUrl [280]
## cmtPoster artPoster artUrl topic
## <chr> <chr> <chr> <int>
## 1 aprilsheep KZS https://www.ptt.cc/bbs/Gossiping/M.1622736133.A.… 1
## 2 coolscott jasondaio https://www.ptt.cc/bbs/Gossiping/M.1622736352.A.… 3
## 3 alan0204 AllenHuang https://www.ptt.cc/bbs/Gossiping/M.1622769664.A.… 1
## 4 cliffcliff AllenHuang https://www.ptt.cc/bbs/Gossiping/M.1622769664.A.… 1
## 5 RELIFE168 AllenHuang https://www.ptt.cc/bbs/Gossiping/M.1622769664.A.… 1
## 6 tw54585 Allen320 https://www.ptt.cc/bbs/Gossiping/M.1622773890.A.… 3
## 7 chenlimath Allen320 https://www.ptt.cc/bbs/Gossiping/M.1622773890.A.… 3
## 8 Tiphareth Allen320 https://www.ptt.cc/bbs/Gossiping/M.1622773890.A.… 3
## 9 farnorth Allen320 https://www.ptt.cc/bbs/Gossiping/M.1622773890.A.… 3
## 10 birdy590 Allen320 https://www.ptt.cc/bbs/Gossiping/M.1622773890.A.… 3
## # … with 270 more rows
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
head(filtered_user,3)## user type
## 1 s9234032 replyer
## 2 yeh67 replyer
## 3 tw54585 replyer
filter_degree = 8
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$topic == "2", "palevioletred", "lightgreen")
# 畫出社群網路圖(degree>7的才畫出來)
set.seed(123)
plot(reviewNetwork, vertex.size=3, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("日本疫苗贈台","民間團體捐贈"),
col=c("palevioletred", "lightgreen"), lty=1, cex=1)
PTT的回覆有三種,推文、噓文、箭頭,我們只要看推噓就好,因此把箭頭清掉,這樣資料筆數較少,所以我們把篩選的條件放寬一些。
filter_degree = 20 # 使用者degree
# 過濾留言者對發文者的推噓程度
link <- posts_Reviews %>%
filter(cmtStatus!="→") %>%
group_by(cmtPoster, artUrl) %>%
filter( n() > 2) %>%
ungroup() %>%
select(cmtPoster, artPoster, artUrl, cmtStatus) %>%
unique()
# 接下來把網路圖畫出來,跟前面做的事都一樣,因此不再細述
# 篩選link中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$cmtStatus == "推", "lightgreen", "palevioletred")
# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=2, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("推","噓"),
col=c("lightgreen","palevioletred"), lty=1, cex=1)可以發現本次的討論中幾乎都是推文、噓文較少
需要設定每個節點的id,記得要從0開始
library(networkD3)
links = link
nodes = filtered_user
nodes$id = 0:(length(nodes$user) - 1)
# 整理資料格式
nodes_complete <- data.frame(nodeID = unique(c(links$cmtPoster, links$artPoster)))
nodes_complete$group <- nodes$type[match(nodes_complete$nodeID, nodes$user)]
links$source <- match(links$cmtPoster, nodes_complete$nodeID) - 1
links$target <- match(links$artPoster, nodes_complete$nodeID) - 1
# 畫圖
library(networkD3)
forceNetwork(Links = links, Nodes = nodes_complete, Source = "source",
Target = "target", NodeID = "nodeID", Group = "group",
opacity = 0.8, fontSize = 10, zoom = TRUE,legend = TRUE, opacityNoHover = TRUE,
colourScale = "d3.scaleOrdinal(d3.schemeCategory10);",
linkColour = ifelse(links$cmtStatus == "推", "palegreen","lightcoral") # 設定推噓顏色
)## Links is a tbl_df. Converting to a plain data frame.
贈送疫苗的討論重點有哪些? 主要分為哪幾種風向? 對於2021-05-01 ~ 2021-06-14收集的文章,大概可以分成討論日本疫苗提供給台灣的新聞、民間團體捐贈的相關報導相關這兩種,其他還有著重討論民進黨與疫苗的相關報導或和美國可能捐贈的相關報導的討論等四種。主題一和三討論重點多在於「日本疫苗提供原因、數量」、「國產」、「民間團體捐贈」等流程的探討。
目前風向最偏哪邊? 客觀討論計算方式的文章不少,但嘲諷、八卦性質的文章也有。
討論贈送疫苗的社群網路如何分布? 以社群文章數來看,報導相關新聞較多,但從社群網路觀察發現,日本贈送疫苗新聞正面貼文討論聲量較高,民間贈送疫苗新聞則負面聲量較高。推論民眾對於接受國外政府贈送的接受度高。
贈送疫苗的意見領袖有誰?網友的推噓狀態如何? 因為資料選取的時間較短,只要幾篇回覆量高的貼文,就有機會成為社群中心,在八卦版上,以報導討論為主的意見領袖及回覆推噓皆有,調侃批評部分也有,網友大多正面推文。