資料集合

資料主要從 ptt 高雄板
尋找關鍵字「公車」、「捷運」和「輕軌」
時間區段從 2014-01-01 至 2018-12-31

require(data.table)
require(dplyr)
require(readr)
require(dplyr)
require(stringr)
require(jiebaR)
require(tidytext)
require(NLP)
require(tidyr)
require(ggplot2)
require(ggraph)
require(igraph)
require(scales)
require(reshape2)
require(widyr)
require(knitr)
library(kableExtra)
library(data.table)
require(ropencc)
require(wordcloud2)
require(lubridate)
require(htmlwidgets)
require(webshot)
require(plotly)
require( RColorBrewer)

require(servr)
require(tm)
require(data.table)
require(stringr)
library(topicmodels)
require(LDAvis)
require(webshot)

## [1] "Chinese (Traditional)_Taiwan.950"

動機與分析目的

背景與動機
- 如果城市裡的公共交通運輸系統帶來的是便利，對人民來說，確實是一大幸事。我們能從新聞或報章雜誌裡了解政府對於公共運輸的政策與方案，但對於民眾對於該議題的想法，則沒有一個特定的平台進行統整。因此我們主要針對民眾在PTT討論版上，對於公車、輕軌、捷運進行討論分析。
研究目的
1. 了解民眾對於公共載具的評價
2. 大眾對於哪些路線及站點具有較高討論度
3. 對於官方議題民眾的想法

資料讀取

主要將 ptt 檔案分成「文章」和「留言」
輸出前 10 筆資料的形式

Sys.setlocale("LC_CTYPE", "cht")

## [1] "Chinese (Traditional)_Taiwan.950"

commnet_A=fread("comment_Kaohsiung.csv",encoding = "UTF-8") %>% filter(find_word!="bike");
noncommnet_A=fread("non_comment_Kaohsiung.csv",encoding = "UTF-8")%>% filter(find_word!="bike");

kable(noncommnet_A %>% head()%>% select(-ip,-imp_word_count,-route_imp_word_count)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "500px")

url	find_word	text	author	title	time	board_content	board_content_type	comment_num
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html	公車	from 一卡通 ipass粉絲團一卡通免費搭市區公車延長到6月底高雄板非心情板、個板，無關高雄廣泛市民之文章請勿張貼。這裡是高雄板，請討論與多數高雄市民有直接關聯性的文章。修改文章標題請於文章列表按大寫T 同性質文章請以大E修改內文方式加註，請勿於板上重覆多發 ◆ From:	scottlu28 (D)	[閒聊] 一卡通免費搭公車延長到9月底	2014-02-25T05:47:15Z	Kaohsiung	閒聊	34
https://www.ptt.cc/bbs/Kaohsiung/M.1393928042.A.3B5.html	公車	網誌好讀版：前鎮分館前言：個月推出一卡通搭市區公車免費。本以為這活動與我無關，後來看到有人認真分享搭免費公車通勤的心得，才開始思考能用這資源做些什麼。到那幾條文化公車，偏偏文化公車不在免費範圍，要錢哪！想了幾天，思路不知怎麼接上的，突然冒出了「圖書館」。公車應該都到得了。再者自幾年前三民分館、寶珠分館空間改造變得很好用之後，對其他分館也有了些好奇心，現在有免費公車，似乎是個好機會？去看看似乎無妨？逛上一次，來個「搭免費公車逛圖書館」！動機和可行性皆備，時間就訂在日檢後出發！ Day 1 102年12月3日（二）晴行程：前鎮分館→小港分館→林園二館→林園分館第一站：前鎮分館交通：────＞〔一心一路口〕───＞〔公正路口〕───＞前鎮分館得先走上一小段才能轉乘，於是眼睜睜地看著對向、也就是我要搭的83公車離開！用語音查詢得知下一班要再等15分鐘，又想到下一站[瑞祥高中]有公共腳踏車，便　　抵達瑞祥高中找到公共腳踏車，離83公車到來還有一小段餘裕，想想還是搭公車吧！便往最近的站牌走去。豈知，就在走到剩沒幾公尺、都已經能看見站牌了，　　所以實際上是：　　　 168東　　　　　　　　　走　　　　　　　　　公共腳踏車　　 ────＞〔一心一路口〕───＞〔瑞祥高中〕───────＞前鎮分館雜記：　　這是整個計劃的第一站！圖書館旅行就從這裡開始！樓梯往上走到閱覽室。走進圖書館，映入眼睛的是一艘獨木舟，頓時還以為跑到那瑪夏或桃源分館去了！明明剛剛是騎公共腳踏車來的呀！物和館藏特色「原住民資料」，有很多沒想過的原住民書籍，有探討研究的、也有輕鬆入門的，比如原住民神話繪本，整整列了兩排書櫃。能看到，很容易搜尋。同為館藏特色的投資理財則倚著牆，不過因為這類書在各分館的量都不少，就不另外借書，只從「原住民資料」區借了一本「巴斯達隘傳說」。照片不多。接著在大樓外頭請路人幫忙拍了一張到此一遊，不小心聊太久，還腳踏車扣了10元。第一站就花到錢了！說好的不花錢呢……？ – 開放時間：週二至週六　09:00-21:00；週日　09:00-17:00 總坪數：599坪座位數：450席位服務電話：07-7173008、07-7173009 [email protected] 傳真電話：07-7174440 – ◆ From: 謝謝各位。目前輕旅行已經告一段落，會慢慢把遊記整理出來。	wildphoenix (小不點)	[遊記] 搭免費公車遊圖書館：前言&前鎮分館	2014-03-04T10:13:59Z	Kaohsiung	遊記	4
https://www.ptt.cc/bbs/Kaohsiung/M.1394096518.A.EA9.html	公車	網誌好讀版 Day 1 102年12月3日（二）晴　　　公共腳踏車　　　　　　　　　69　　　　　　　　公共腳踏車　　回到瑞祥高中還公共腳踏車，一台預定搭的紅12從旁邊跑掉，索性直接騎到捷運凱旋站再還。還車時耍笨，沒注意到袋子的提把勾到腳踏車就上架，只好再借再借還車並順利衝到前方幾公尺的站牌呢？　　就在完成還車程序時，綠燈也亮起，不加思索拔腿狂奔，眼看公車啟動、即將在我前方停下打開車門！原來是好心的公車司機要讓我上車！太感謝了。　　往小港的路上人不多，廟很多，有一小段路就有三間廟，而且規模不小。到小，等走回餐旅學院已經是半小時後了…… 　　小港分館實在太有特色！不只館內東逛西逛，館外也好好地繞上一圈，明明從館外繞不覺得大，進到裡頭又不覺得小。館舍左方有停車場和一小塊庭園，外觀有　　進到裡頭徵求拍照許可才知道原來小港分館是得過獎的綠建築，常有人來此拍照，難怪申請表格早早就備好，填完表格也有攝影證可佩戴。興奮之下到處亂拍，　　從樓梯走上二樓即可看到牆上掛著天文資料，看來是館藏特色「天文」的一部份；同一樓另外有個由企業認養的空間，推廣環保和介紹鳳山丘陵及生態，第一次　　館內空間充足，不過藏書還不多，書架沒有擺滿，空書架也不少，要再過幾年才會比較有書的味道吧！可是有很棒的採光（雖然照片看不出來 Orz），每樓都有感覺心都靜了下來。　　整棟下來逛下來似乎沒看見電梯？正要詢問館員時瞥見了。原來在角落，向外有待加強啊！　　小港分館實在太有特色！一不小心就多待了，導致還腳踏車時又被扣了30元， – 開放時間：週二至週六09:00-21:00；週日09:00-17:00(99/11/12日開館) 休館日期：週一及國定假日休館座位數：閱覽席次218位館藏量：5萬多冊傳真電話：801-2226 地　　址：小港區博學路365號 –	wildphoenix (小不點)	[遊記] 搭免費公車遊圖書館：小港分館	2014-03-06T09:01:55Z	Kaohsiung	遊記	2
https://www.ptt.cc/bbs/Kaohsiung/M.1394442060.A.13D.html	公車	網頁好讀版 Day 1 102年12月3日（二）晴　　　───＞〔林園區公所〕──＞林園分館，所以沒撲空。　　這站有友人加入，但因為我的車較晚到，結果還是各逛各的…… 　　詢問拍照許可時不小心引出主任，說明原因後介紹了一本「小地方大書房：98 師在改造空間時的想法。有機會要去找來看看寶珠分館和陽明分館的介紹。　　同樣地，林園分館分別在入口和廁所前方展示了書籍整理和館舍改造的照片故　　像是友善的行動動線（入口坡道、電梯、樓梯扶手、粗糙不滑的梯面、反光條）、二樓的社區文化展示，以及大大的兒童閱覽區。聽說地下室還有？可惜不開放　　新書展示區和開卷主題合在一起，整個陳列區很大，一進門一定會被吸引過來佇足，找找有無喜歡的書。我也在這找到「天生就會跑」，現在還不是借的時候，　　二樓書庫分左右兩間，都不大，但wifi訊號很強很好用。三樓是自修室和館藏特色「紅樹林」區，不過一般借書只會爬到二樓吧！會到三樓多半是去自修室，感不定。　　逛完林園分館才和友人會合，解決晚餐，並在18:44搭末班橘9公車回家，一邊可惜來不及到大寮分館，順便發現橘9的路線可到4間分館，好像可以來研究一下行程幾乎可以決定了！回程經過陽明分館，可惜時間太晚、已過開放時間，只好留到下次。檢討：　　第一天行程的事先規劃做得不夠，只查了搭哪些車、沒查時刻表便匆匆上路，真的如預定再去大寮分館，肯定會困在大寮、沒公車回家。　　也因為太隨性規劃不足，交通上多花了不少時間，一站一站累積下來，回到家　　 – 林園分館休館日期：週一、國定假日總坪數：1182.31平方公尺館藏量：約4萬5千多冊服務電話：07-6431419 [email protected] 地　　址：高雄市林園區林園北路236號 ◆ From:	wildphoenix (小不點)	[遊記] 搭免費公車遊圖書館：林園分館	2014-03-10T09:00:57Z	Kaohsiung	遊記	2
https://www.ptt.cc/bbs/Kaohsiung/M.1394615924.A.0EE.html	公車	網頁好讀版 Day 2 102年12月6日　　　　紅33B　　　　　　　　橘7B　　　　　　　　　　走　　　　7:18　　　　　　　　　7:51　　　約9:08到　　搭橘7B公車很好玩，乘客和司機似乎彼此是熟面孔，可以聽到他們話家常，聊前一天末班車載幾個人啦、某個學校的老師也搭末班車或大樹祈福線啦、某校　　於是我大膽向司機請教到大樹果菜市場是否要在溪埔派出所站下車，結果司機和乘客都表示不知道，只知道姑山倉庫、姑婆寮，都不是我要去的。只好繼續讓我在大樹區公所問清楚再繼續往前開！怎麼這麼親切好心啊！　　順帶一提，中途在只有去程才會開過去的和山寺有短暫的休息時間，聽到司雜記：有心理準備待上一整天直到閉館；如果不想整天被困在同一處，就要有靠雙腳走五公里進入橘7A路線的覺悟。因為二個覺悟都沒有，曾經一度考慮放棄大樹三館　　根據站站時刻表，公車從溪埔派出所到終點佛光山，再從那邊回程到溪埔派出所有短短的約14分鐘的空檔，辦得到！抓緊時間下車後快速問路、衝到圖書館　　只是時間真的很趕，來不及借書拍到此一遊照，館內也未能多加留意，可惜，只好留待下次了。不過要是這空檔無法增加，恐怕再來幾次也還是無法靜下心　　到的時候其實滿早的，才剛開館，館員阿姨正在夾報紙，似乎有被我嚇到，如果不是那麼趕，有記得跟她道早安就好了。大樹三館開放時間：週三至週日09:00-17:00 休館日期：週一、二、國定假日座位數：42 位館藏量：約9千多冊傳真電話：6564272 地　　址：84049大樹區溪埔里溪埔路3巷60-1號2樓 –	wildphoenix (小不點)	[遊記] 搭免費公車遊圖書館：大樹三館	2014-03-12T09:18:42Z	Kaohsiung	遊記	4
https://www.ptt.cc/bbs/Kaohsiung/M.1394803517.A.1A9.html	公車	網頁好讀版 Day 2 102年12月6日（五）晴鳳山曹公分館　　　　橘7B　　　　　　　　　走　　　　9:22 究站牌，試圖找出公路客運的替代方式，不過……全部都沒車是怎麼一回事？所以到大樹分館也比預定的晚，原先預估的停留時間不夠用。樹三館和果菜市場的位置，聽司機說果菜市場那邊平常經過都關著，只有鳳荔節左右才開，難怪司機就不知道那邊便是我要去的果菜市場和圖書館，大概只有在地人　　這位司機真的很熱心！去程問路雖然不知道，也沒把我丟著不管，回程又載到我也關心了一下。而且還記得我一開始是從長庚上車的，雖然據說是會搭這路的乘雜記：聊了一下，原來剛開始實施也被抱怨過，不過久了習慣後，發現脫鞋入館可以隨時席地閱讀也不怕髒，不只是兒童閱覽區，是整館都能以地為席。若是假日就能親眼　　聽說大樹分館原本在山上，沒什麼人願意特地上山借書，搬到現址後方便得多，讀者也多起來了。二樓自修室也很特別，有一方陽台可提供給不喜歡冷氣的人使我也心動想在這個陽台讀一次書回味回味。當然，當年的考試壓力就免了！大樹分館開放時間：週二至週六09:00-21:00週日9:00-17:00 休館日期：週一、國定假日座位數：60位館藏量：約2萬多冊傳真電話：07-6517634 地　　址：84047大樹區中興西路50號 – 還好橘7在附近有站，還是到得了。耍笨了，謝謝提醒。免費公車還有三個多月，可以先從高雄的大學開始。	wildphoenix (小不點)	[遊記] 搭免費公車遊圖書館：大樹分館	2014-03-14T13:25:14Z	Kaohsiung	遊記	4

kable(commnet_A %>% head() %>% select(-ip,-author,-time,-comment_ip)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "500px")

url	find_word	commet_num	text	comment_type	title	board_content	board_content_type	comment_author	comment_time
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html	公車	69025	這樣不會狂虧損嗎? 雖然開心免費這件事	推	[閒聊] 一卡通免費搭公車延長到9月底	Kaohsiung	閒聊	Alex1103	2014-02-25T08:03:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html	公車	69026	到底是要免費多久啊=口=	→	[閒聊] 一卡通免費搭公車延長到9月底	Kaohsiung	閒聊	blankhole	2014-02-25T07:13:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html	公車	69027	推阿~~希望公車能再進步些	推	[閒聊] 一卡通免費搭公車延長到9月底	Kaohsiung	閒聊	cy2013	2014-02-25T12:19:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html	公車	69028	讚!	推	[閒聊] 一卡通免費搭公車延長到9月底	Kaohsiung	閒聊	daisukidayo	2014-02-26T10:59:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html	公車	69029	最近看到越來越多人拿一卡通道OK小額消費了_好現象	推	[閒聊] 一卡通免費搭公車延長到9月底	Kaohsiung	閒聊	dreamcoast	2014-02-25T15:47:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html	公車	69030	大推! 繼續搭公車去旅行~	推	[閒聊] 一卡通免費搭公車延長到9月底	Kaohsiung	閒聊	familia	2014-03-01T01:55:00Z

字典讀取

讀取以下詞典
- 停用詞典(https://github.com/yinzm/ChineseStopWords/blob/master/ChineseStopWords.txt)
- 公車站點詞典
- 公車路線詞典 ( https://reurl.cc/lamGv wikipeida 爬文 )
- 捷運與輕軌站點名稱和代碼詞典 ( https://reurl.cc/vKY9A wikipeida 爬文 )
- 高雄路名詞典 ( https://reurl.cc/bNQZ6 wikipeida 爬文 )
- 自新增詞典 ( 使用 bigram 增加新字詞 )

stop_words_org <- readLines("./dict/stop_words.txt", 
                   encoding='utf-8')

## Warning in readLines("./dict/stop_words.txt", encoding = "utf-8"): 於 './
## dict/stop_words.txt' 找到不完整的最後一列

Encoding(stop_words_org)<-"UTF-8"
new_stop=c("板規","公車","捷運","輕軌","高雄市","高雄","更新","知道","謝謝","刪除","不過","目前","如果",
           "指出","文章","發現","請問","真的","請於","這種","內文","大寫","這裡","列表","標題","只有","其實",
           "有人","修改文章","xd")
stop_words=c(stop_words_org,new_stop)


station_name=readLines("./Station_data/station_name.txt")
route_name_org=readLines("./Station_data/route_name.txt")
route_name=ifelse(!str_detect(route_name_org,"[^0-9]"),paste0(route_name_org,"路"),route_name_org)

mrt_all_name=readLines("./Station_data/mrt_station_name.txt")
mrt_all_code=readLines("./Station_data/mrt_station_code.txt")

place_name=readLines("./Station_data/place_name.txt")
road_name=readLines("./Station_data/road_name.txt")

station_route_road=c(station_name,route_name,mrt_all_name,mrt_all_code,place_name,road_name)
#station_route_road=c(station_name,route_name,mrt_all_name,mrt_all_code,place_name,road_name)

my_word=readLines("./Station_data/my_dict.txt")


T1=rbind(cbind(station_name,"站點名稱"),
      cbind(route_name,"路線名稱"),
      cbind(c(mrt_all_name,mrt_all_code),"捷運相關"),
      cbind(road_name,"高雄路名稱")) %>% as.data.frame()
colnames(T1)=c("word","type")

T1=T1 %>% mutate(word=iconv(word,"big5","UTF-8"),type=iconv(type,"big5","UTF-8"))
T1 %>% DT::datatable()

# %>%
#   kable_styling(bootstrap_options = "striped", full_width = F) %>%
#   scroll_box(height = "500px",width="300px")

情緒詞典介紹

使用 CVAW3 做情緒字典，並從中選取與交通相關情緒字
CVAW3 與 LIWC 比較的好處如下
- 「正負面情緒」 1 到9 ( 為方便處理轉換成 -4 到 4 )
- 「亢奮情緒」 1 到9 ( 為方便處理轉換成 -4 到 4 )

my_sentiment=readLines("./dict/my_sentiment_word.txt")

#情緒字眼
CVAW3_org=read.csv("./dict/CVAW3.csv",header=T,stringsAsFactors = FALSE)
CVAW3_org=CVAW3_org %>% mutate(Word=iconv(Word,"big5","UTF-8"))

plot_ly(data=CVAW3_org,x=~Valence_Mean,y=~Arousal_Mean,text = ~Word,
        marker = list(size = 7,
                       color = 'rgba(182, 192, 255, .9)',
                       line = list(color = 'rgba(0, 0, 152, .8)',
                                   width = 2))) %>%
  layout(
    title = "CVAW3",
    xaxis = list(title = "情緒值"),
    yaxis = list(title = "亢奮值")
    )

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

CVAW3=CVAW3_org %>% mutate(Valence_Mean=Valence_Mean-5,Arousal_Mean=Arousal_Mean-5) %>% 
  filter(Word%in%my_sentiment)



sentiment_delete_word=c("優惠","免費","才能","凱旋","很多")

CVAW3=CVAW3 %>% filter(!(Word%in%sentiment_delete_word)) 
sentiment_word=CVAW3$Word

all_word=c(station_name,route_name,mrt_all_name,mrt_all_code,place_name,road_name,my_word,sentiment_word)

資料前處理

將以下的形式進行轉換
- xx:xx $\rightarrow$ [時間]
- xx (分|分鐘|秒) $\rightarrow$ [時段]
- xx.xx (公里|公尺|km|m) $\rightarrow$ [距離]
- 英文大寫 $\rightarrow$ 英文小寫
- xx.xx (元|$) $\rightarrow$ 錢
- - 符號刪除
- xx (公車|號) $\rightarrow$ xx 路
列出出現 [時段] 的 5 個文章

time_change_format<-function(text_tmp){
  text_tmp=gsub("[0-9]{1,2}[:：][0-9]{2}","[時間]",text_tmp);
  text_tmp=gsub("[0-9]{1,2}(分鐘|分|秒)","[時段]",text_tmp);
  text_tmp=gsub("[0-9]{1,}[\\.]?[0-9]{1,}[ ]?(公里|公尺|km|m)","[距離]",text_tmp);
  #text_tmp=gsub("\\$[0-9]{1,}","錢alpha",text_tmp);
  text_tmp=tolower(text_tmp);
  text_tmp=gsub("[0-9\\.一二三四五六七八九億萬千百十k]{1,}(元|\\$)","[錢]",text_tmp);
  text_tmp=gsub("\\$[0-9\\.一二三四五六七八九億萬千百十k]{1,}","[錢]",text_tmp);
  text_tmp=gsub("\\-","",text_tmp);
  text_tmp=gsub(paste0("(",paste(route_name_org,collapse = "|"),")(公車|號)"),"\\1路",text_tmp);
  return(text_tmp)
}

commnet_A=commnet_A  %>%
  filter(str_detect(text, regex("[\u4e00-\u62ff\u6300-\u77ff\u7800-\u8cff\u8d00-\u9fff]"))) %>% 
  mutate(text=time_change_format(text))


noncommnet_A=noncommnet_A %>% mutate(text=gsub("\n|\\s+","",text)) %>% 
  mutate(text=time_change_format(text))

kable(noncommnet_A  %>% filter(str_detect(text,"\\[時段\\]")) %>% arrange(nchar(text)) %>% head()) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "500px")

url	find_word	text	ip	author	title	time	board_content	board_content_type	comment_num	imp_word_count	route_imp_word_count
https://www.ptt.cc/bbs/Kaohsiung/M.1420994736.A.597.html	捷運	如果搭捷運至小港站步行到小港醫院會很遠嗎?查過辜狗地圖顯示步行[時段]左右請問這一段路真的[時段]能走到嘛?謝謝	1.173.156.29	Lucialover (來日方長)	[問題] 捷運小港站至小港醫院走路會很遠嗎?	2015-01-11T16:45:33Z	Kaohsiung	問題	17	9	0
https://www.ptt.cc/bbs/Kaohsiung/M.1411517239.A.D41.html	公車	剛剛要搭這班車，[時間]分的，結果在凹子底沒有等到車，看牌子是預計[時間]分會到在想說是我自己錯過嗎??可是我[時段]前就在等車了		te040928 (Mika)	[問題] 301路公車	2014-09-24T00:07:16Z	Kaohsiung	問題	6	5	0
https://www.ptt.cc/bbs/Kaohsiung/M.1491981832.A.BFC.html	公車	我自己比較常搭的是民族幹線90路和168東西線這三條件是每班次間距大概[時段]左右比較頻繁的路線因為接下來是雨季搭公車還是比較安全一點謝謝	180.217.236.186	HIKONA (《開始游牧民族》)	[問題] 高雄班次比較多的公車	2017-04-12T07:23:49Z	Kaohsiung	問題	26	10	4
https://www.ptt.cc/bbs/Kaohsiung/M.1491456320.A.0BE.html	捷運	最近要去小港喝喜酒，地點離小港捷運站開車約時段，想說坐捷運到小港站，再租機車or搭小黃過去!請問此2種方式可行?		wagotow (^^)	[問題] 捷運小港站附近有出租機車or計程車排班?	2017-04-06T05:25:17Z	Kaohsiung	問題	3	2	0
https://www.ptt.cc/bbs/Kaohsiung/M.1506861752.A.C7D.html	捷運	捷運真的很慢臺鐵還比較快而開車又太久只要設下去左營南科間[時段]天天通勤不是問題(北竹都有人通勤了月票才7千)就看市府有沒有前瞻去推了這樣南高合作不是很好嗎		dolare (dolare~)	Re: [問題] 台積電不來，路竹捷運怎麼辦??	2017-10-01T12:42:30Z	Kaohsiung	問題	30	4	0
https://www.ptt.cc/bbs/Kaohsiung/M.1543131674.A.9A4.html	捷運	幹嘛蓋？不是說高雄一堆蚊子館只會借廁所嗎？新市長肯定要拆掉輕軌、捷運改建藝文中心成為廁所亞洲新灣區所有建設停工全面拼經濟在高雄政治[時段]請到台北抗爭討論市長政見叫崩潰？	223.138.6.24	TWNchurchill (金城武)	Re: [閒聊] 捷運輕軌還會蓋嗎	2018-11-25T07:41:12Z	Kaohsiung	閒聊	88	10	0

發文數量

每月的發文如下:三張圖主要依據搜索字進行討論，公車的圖表較為均勻，且為三者相比之下較討論次數較低，主要因為公車沒有特別議題。
捷運和輕軌的討論度較為相似，在2017年6月左右，對於輕軌第二階段是否要繼續興建及是否要加速新增捷運黃線。

noncommnet_A1=noncommnet_A %>% mutate(time=gsub("T|Z","",time))%>% 
  mutate(time=parse_date_time(time,"%Y-%m-%d %H-%M-%S"))%>% 
  mutate(month=floor_date(time,"month") %>% as.Date()) %>% 
  mutate(bus_route_ind=str_detect(text,paste0("(",paste(c(route_name),collapse = "|"),")")))

noncommnet_A1  %>% group_by(month,find_word) %>% summarise(count=n()) %>% 
  group_by(find_word) %>% mutate(max_count=max(count)) %>% 
ggplot(aes(x= month,y=count,fill=find_word)) +
geom_col(show.legend = FALSE) +
  scale_x_date(labels = date_format("%Y/%m/%d")) +
  facet_wrap(~find_word, ncol = 1, scales = "free_y")+
  geom_text(aes(label=ifelse(count==max_count,as.character(month),"")),fontface="bold",size=4)

文字雲

公車刪除停用詞文字雲

因公車較無探討議題，PTT內文主要是評論司機、班次誤點、一卡通優惠。

jieba_tokenizer = worker()
new_user_word(jieba_tokenizer, c(all_word,"[時間]","[時段]","[距離]","[錢]"))

## [1] TRUE

chi_tokenizer <- function(t) {
  lapply(t, function(x) {
    if(nchar(x)>1){ # 新增的地方
      tokens <- segment(x, jieba_tokenizer)
      # 去掉字串長度爲1的詞彙
      tokens <- tokens[nchar(tokens)>1]
      return(tokens)
    }
  })
}

A_place <- noncommnet_A1 %>% select(month,find_word,text)  %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>%
  #filter(!str_detect(word, regex("[a-zA-Z]"))) %>%
  count(find_word,word, sort = TRUE)  %>% 
  filter(!(word %in% stop_words)) 

A_place_filter=A_place%>% filter(word%in%station_route_road)


A_place %>% filter(find_word=="公車") %>% group_by(word)  %>% summarise(count=sum(n))%>% 
  top_n(100,count)  %>% 
  filter(str_detect(word,"[^0-9]")) %>% 
  wordcloud2() -> tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc1.png", delay=10)

捷運刪除停用詞文字雲

主要探討議題有兩個，一個為北捷與高捷運量比較。
另一個是捷運黃線的高架化(地下化)探討。

A_place %>% filter(find_word=="捷運") %>% group_by(word)  %>% summarise(count=sum(n)) %>%
  top_n(100,count) %>% 
  filter(str_detect(word,"[^0-9]")) %>% 
  wordcloud2() ->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc2.png", delay=10)

輕軌刪除停用詞文字雲

輕軌於大順路的未來營運效益及興建交通的衝擊。

A_place %>% filter(find_word=="輕軌") %>% group_by(word)  %>% summarise(count=sum(n))%>% 
  top_n(100,count) %>% 
  filter(str_detect(word,"[^0-9]")) %>% 
  wordcloud2() -> tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc3.png", delay=10)

所有運具前 20 大出現字詞

A_place %>% group_by(find_word,word)  %>% summarise(count=sum(n))%>%
  group_by(find_word) %>% 
  top_n(20,count) %>% 
  ungroup %>% 
  mutate(word = reorder(word, count)) %>% 
  ggplot(aes(word, count, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

公車(只保留路線、站點和路名)

100路、三多商圈較靠近市區，因此討論度較高。
中華幹線、168東則是跨越多個行政區因此也為民眾討論度也頗高。

A_place_filter  %>% filter(find_word=="公車") %>% group_by(word)  %>% summarise(count=sum(n)) %>% 
  wordcloud2() ->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc4.png", delay=10)

捷運(只保留路線、站點和路名)

在捷運規畫中藍線與黃建交會於大順路口，因此大順路教會明顯。
2018年11月左右，行政院長賴清德今天表示「要讓高雄捷運延伸至台南」使得台南次數變高。
紅線為(岡山-小港) 、橘線為(大寮-西子灣)民眾描述捷運紅、橘線也可會使用岡山-小港、大寮-西子灣表示，因此此四個字使用頻率較高。

A_place_filter %>% filter(find_word=="捷運") %>% group_by(word)  %>% summarise(count=sum(n)) %>% 
  wordcloud2()->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc5.png", delay=10)

輕軌(只保留路線、站點和路名)

輕軌站名:哈瑪星、夢時代、凱旋、籬內仔
輕軌經過大順路受到許多爭議。

A_place_filter%>% filter(find_word=="輕軌") %>% group_by(word)  %>% summarise(count=sum(n)) %>% 
  wordcloud2()->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc6.png", delay=10)

所有運具前 20 大出現字詞(只保留路線、站點和路名)

A_place_filter %>% group_by(find_word,word)  %>% summarise(count=sum(n))%>%
  group_by(find_word) %>% 
  top_n(20,count) %>% 
  ungroup %>% 
  mutate(word = reorder(word, count)) %>% 
  ggplot(aes(word, count, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

Tf-Idf 查看重要的字

將每一篇文章前 10 大 tf-idf 的字取出後統計字出現頻率
我們發現Tf-Idf 有抓出該主題的重要詞彙
- 公車重要字:司機、交通局、一卡通優惠免費措施、站牌
- 捷運重要字:黃線、高捷、運量、一卡通優惠、草衙、左營
- 輕軌重要字:大順路、試營運、二階、試乘、工程

Url_words <- noncommnet_A1 %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>%
  filter(!str_detect(word, regex("[a-zA-Z]"))) %>%
  count(find_word,url, word, sort = TRUE)

Url_words_tf_idf <- Url_words %>%
  bind_tf_idf(word, url, n) %>% 
  filter(!(word%in%stop_words))

# 選每篇文章，tf-idf最大的十個詞，
# 並查看每個詞被選中的次數
Url_words_tf_idf%>% 
  group_by(url,find_word) %>%
  top_n(10,wt=tf_idf) %>%
  arrange(desc(url)) %>%
  group_by(find_word) %>%
  count(word, sort=TRUE) %>%
  top_n(20)%>%
  ungroup %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

## Selecting by n

情緒分析

讀取 negation word

negation_word_tmp=readLines("./dict/negation_word/negation_word.txt",encoding = "UTF-8")
negation_word_tmp2=strsplit(negation_word_tmp,split="\t") %>% lapply( function(x)x[1]) %>% unlist
negation_word_tmp2_ind=strsplit(negation_word_tmp,split="\t") %>% lapply( function(x)x[2]) %>% unlist
ccst = converter(S2T)
negation_word_tmp3=c(ccst[negation_word_tmp2[negation_word_tmp2_ind==0]][1:220],"不應該","不符")
negation_word_tmp3

##   [1] "非"       "別"       "不"       "沒"       "無"       "勿"      
##   [7] "別出"     "別打"     "別帶"     "別動"     "別發"     "別煩"    
##  [13] "別放"     "別管"     "別開"     "別看"     "別來"     "別理"    
##  [19] "別亂"     "別弄"     "別惹"     "別是"     "別說"     "別提"    
##  [25] "別學"     "並不"     "不必"     "不成"     "不出"     "不大"    
##  [31] "不當"     "不到"     "不得"     "不等"     "不屌"     "不定"    
##  [37] "不懂"     "不動"     "不對"     "不多"     "不放"     "不該"    
##  [43] "不敢"     "不幹"     "不高"     "不搞"     "不給"     "不夠"    
##  [49] "不管"     "不行"     "不好"     "不合"     "不和"     "不會"    
##  [55] "不及"     "不佳"     "不見"     "不講"     "不可"     "不肯"    
##  [61] "不快"     "不了"     "不理"     "不利"     "不良"     "不靈"    
##  [67] "不買"     "不滿"     "不明"     "不能"     "不清"     "不去"    
##  [73] "不全"     "不讓"     "不認"     "不容"     "不如"     "不勝"    
##  [79] "不是"     "不說"     "不算"     "不太"     "不通"     "不妥"    
##  [85] "不像"     "不信"     "不需"     "不許"     "不要"     "不應"    
##  [91] "不用"     "不願"     "不再"     "不在"     "不值"     "不止"    
##  [97] "不住"     "不準"     "不足"     "不做"     "從不"     "都不"    
## [103] "而不"     "還不"     "還沒"     "毫不"     "毫無"     "很不"    
## [109] "決不"     "絕不"     "沒錯"     "沒法"     "沒空"     "沒錢"    
## [115] "沒人"     "沒用"     "沒有"     "請別"     "無法"     "無關"    
## [121] "無效"     "無需"     "無用"     "永不"     "又不"     "又沒"    
## [127] "沒戲"     "阻止"     "制止"     "遠非"     "絕非"     "決非"    
## [133] "否認"     "否定"     "難以"     "短缺"     "缺場"     "缺數"    
## [139] "稀缺"     "缺少"     "缺貨"     "未能"     "未定"     "請勿"    
## [145] "勿食"     "很差"     "差勁"     "差評"     "不堪"     "不同"    
## [151] "不便"     "不知"     "不宜"     "不負"     "不顧"     "不想"    
## [157] "匱乏"     "缺乏"     "匿乏"     "饋乏"     "窘乏"     "並不是"  
## [163] "並沒有"   "不等於"   "不合理"   "不靠譜"   "不可能"   "不瞭解"  
## [169] "不確定"   "不需要"   "不要緊"   "不一樣"   "不用說"   "不爭氣"  
## [175] "不知道"   "達不到"   "得不到"   "過不去"   "看不到"   "看不見"  
## [181] "看不清"   "來不及"   "買不起"   "沒辦法"   "沒想到"   "沒有過"  
## [187] "沒有用"   "上不去"   "受不了"   "想不到"   "用不了"   "用不<U+7740>"
## [193] "有沒有"   "找不到"   "非必需"   "非行家"   "未生效"   "從未有"  
## [199] "勿觸摸"   "非常差"   "食慾差"   "用不上"   "不像話"   "不可以"  
## [205] "不足取"   "顧不得"   "不值得"   "碰不得"   "記不住"   "成不了"  
## [211] "不入流"   "不願意"   "容不得"   "不熟悉"   "不能夠"   "不可比"  
## [217] "瞧不上"   "抗不住"   "不怎麼樣" "寸步難行" "不應該"   "不符"

主文情緒分析

利用「情緒字典」計算主文中文情緒
給出頻率前 20 大的情緒詞

從下面的結果得出不同載具的情緒字眼有所不同

公車
- 方便: 高雄公車一直被討論方不方便
- 問題: 高雄公車搭乘人數較少，大家會討論問題點
- 麻煩: 高雄公車不夠便利，大家覺得搭乘很麻煩
捷運 +建設、發展: 捷運需要花費大量金額建設和發展 +方便: 同高雄公車 +問題: 通高雄公車
輕軌
- 建設、問題: 同上
- 改善、安全: 輕軌的安全度一直備受爭議
- 完成: 輕軌最近才完成

######## 單一斷詞
A_url_org <- noncommnet_A1 %>% select(url,month,find_word,text)  %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>% 
  count(url,month,find_word,word, sort = TRUE)  %>% 
  inner_join(CVAW3,by=c("word"="Word")) 


A_url_org %>% group_by(find_word,word) %>% 
  summarise(n=sum(n),total_Valence_Mean=sum(n*Valence_Mean),total_Arousal_Mean=sum(n*Arousal_Mean))%>% 
  mutate(size_Valence_Mean=(total_Valence_Mean>=0),size_Arousal_Mean=(total_Arousal_Mean>=0)) %>% 
  group_by(find_word,size_Valence_Mean) %>% 
  top_n(10,wt = abs(total_Valence_Mean)) %>% 
  ungroup %>% 
  mutate(word = reorder(word, total_Valence_Mean)) %>% 
  ggplot(aes(word, total_Valence_Mean, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

利用「否定詞」修正 bigram 中文情緒
對不同的「否定詞」給出頻率前 15 大的情緒詞
目前給出出現頻率最大的兩個否定詞「不」和「沒」

# unnest_tokens 使用的bigram分詞函數
# Input: a character vector
# Output: a list of character vectors of the same length
jieba_bigram <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    bigram<- ngrams(tokens, 2)
    bigram <- lapply(bigram, paste, collapse = " ")
    unlist(bigram)
  })
}

A_url_org_bigram_dict <- noncommnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word)

A_url_org_bigram <- noncommnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(url,month,find_word,word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word) %>% 
  left_join(CVAW3,by=c("word2"="Word"))

A_url_org_bigram2=A_url_org_bigram%>% 
  group_by(url) %>%
  summarise(negation_Valence_per_url=sum(n*Valence_Mean),negation_Arousal_per_url=sum(n*Arousal_Mean)) 


tmp1=A_url_org_bigram %>% group_by(find_word,word1) %>% summarise(word1_n=n()) %>% group_by(find_word) %>% 
  top_n(2,word1_n)

A_url_org_bigram %>% left_join(tmp1,by=c("find_word","word1")) %>% na.omit %>% 
  mutate(tmp_ind=paste(find_word,word1),total_Valence_Mean=-1*(n*Valence_Mean)) %>%
  group_by(tmp_ind) %>% 
  top_n(15,abs(total_Valence_Mean)) %>% 
  ungroup() %>% 
  mutate(word2 = reorder(word2, total_Valence_Mean)) %>% 
  ggplot(aes(word2, total_Valence_Mean, fill = tmp_ind)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~tmp_ind, nrow = 2, scales="free",dir="v") + 
  coord_flip()

單一斷詞情緒與 bigram 修正情緒結合
計算每一篇文章的情緒值和亢奮值
從下圖可得出
- 正面文章都會以較為理性的方式撰寫 ( 有可能誇太大容易引起反彈被攻擊 )
- 反面文章都比較激動

A_url_org2_tmp=A_url_org%>% 
  group_by(url,month,find_word) %>%
  summarise(Valence_per_url=sum(n*Valence_Mean),Arousal_per_url=sum(n*Arousal_Mean)) 

A_url_org2=A_url_org2_tmp %>% left_join(A_url_org_bigram2,by="url") %>% 
  mutate(Valence_per_url=Valence_per_url-2*ifelse(!is.na(negation_Valence_per_url),negation_Valence_per_url,0),
         Arousal_per_url=Arousal_per_url-2*ifelse(!is.na(negation_Valence_per_url),negation_Arousal_per_url,0))


tmp_A= noncommnet_A %>% select(url,title);
tmp_A2=A_url_org2 %>% left_join(tmp_A,by="url") %>% 
  mutate(hyper_link=paste0("<a href='",url,"'>","Go","</a>"));
tmp_A3=tmp_A2 %>% ungroup %>% filter(Arousal_per_url>=quantile(Arousal_per_url,0.995)|Arousal_per_url<=quantile(Arousal_per_url,0.005)|
                                       Valence_per_url>=quantile(Valence_per_url,0.995)|Valence_per_url<=quantile(Valence_per_url,0.005))

plot_ly(data=tmp_A2,x=~Valence_per_url,y=~Arousal_per_url,text = ~title,type = 'scatter',
           mode = 'markers') %>%
  add_trace(data=tmp_A3,
    x=~(Valence_per_url-1.25),
    y=~(Arousal_per_url-1.25),
    text=~hyper_link,
    mode = 'text',
    showlegend = F
  ) %>%
  layout(
    title = "ptt 每一篇文章",
    xaxis = list(title = "情緒值(越大越正面)"),
    yaxis = list(title = "亢奮值(越大越亢奮)")
    )

計算每月平均每篇文章的情緒值和亢奮值
雖情緒有變動，但較為亢奮的文章表示當月內文平均較激動，更值得注意
這裡較為激進的月份分別為
- 公車: 2017-05、2018-12
- 捷運: 2018-03、2018-12
- 輕軌: 2018-04、2018-10

A_url=A_url_org2%>% 
  group_by(month,find_word) %>% 
  summarise(Valence_url_Mean=mean(Valence_per_url),
            Arousal_url_Mean=mean(Arousal_per_url)
            )


all_month=seq(from=min(A_url$month),to=max(A_url$month),by=30) %>% floor_date("month") %>% unique()
all_month_2=all_month[seq(1,length(all_month),4)]

A_url  %>% gather(line_type,Value,-month,-find_word) %>%
  group_by(line_type,find_word) %>% mutate(max_Value=max(Value),min_Value=min(Value)) %>%
  ungroup %>% 
  ggplot()+
  geom_line(aes(x=month,y=Value,group=line_type,colour=line_type),size=1.5)+
  scale_x_date(labels = date_format("%Y/%m"),breaks = all_month_2)+
  geom_text(aes(month,Value,label=ifelse(Value==max_Value & line_type=="Valence_url_Mean",as.character(month),"")),fontface="bold",size=4,color="blue")+
  geom_text(aes(month,Value,label=ifelse(Value==min_Value & line_type=="Valence_url_Mean",as.character(month),"")),fontface="bold",size=4,color="red")+
  facet_wrap(~find_word, ncol = 1, scales = "free_y")+
  theme(axis.text.x = element_text(face="bold", color="#993333",  angle=45))

評論情緒分析

利用「情緒字典」計算評論中文情緒
給出頻率前 20 大的情緒詞
解釋同主文

commnet_A1=commnet_A %>% mutate(time=gsub("T|Z","",time))%>% 
  mutate(time=parse_date_time(time,"%Y-%m-%d %H-%M-%S"))%>% 
  mutate(month=floor_date(time,"month") %>% as.Date()) 

######## 單一斷詞
A_commet_url_org <- commnet_A1 %>% select(url,commet_num,month,find_word,text)  %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>% 
  count(url,commet_num,month,find_word,word, sort = TRUE)  %>% 
  inner_join(CVAW3,by=c("word"="Word")) 

A_commet_url_org %>% group_by(find_word,word) %>% 
  summarise(n=sum(n),total_Valence_Mean=sum(n*Valence_Mean),total_Arousal_Mean=sum(n*Arousal_Mean))%>% 
  mutate(size_Valence_Mean=(total_Valence_Mean>=0),size_Arousal_Mean=(total_Arousal_Mean>=0)) %>% 
  group_by(find_word,size_Valence_Mean) %>% 
  top_n(10,wt = abs(total_Valence_Mean)) %>% 
  ungroup %>% 
  mutate(word = reorder(word, total_Valence_Mean)) %>% 
  ggplot(aes(word, total_Valence_Mean, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

利用「否定詞」修正 bigram 中文情緒

######## bigram
A_commet_url_org_bigram_dict <- commnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word)

A_commet_url_org_bigram <- commnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(url,commet_num,month,find_word,word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word) %>% 
  left_join(CVAW3,by=c("word2"="Word"))

tmp2=A_commet_url_org_bigram %>% group_by(find_word,word1) %>% summarise(word1_n=n()) %>% group_by(find_word) %>% 
  top_n(2,word1_n)

A_commet_url_org_bigram %>% left_join(tmp2,by=c("find_word","word1")) %>% na.omit %>% 
  mutate(tmp_ind=paste(find_word,word1),total_Valence_Mean=-1*(n*Valence_Mean)) %>%
  group_by(tmp_ind) %>% 
  top_n(15,abs(total_Valence_Mean)) %>% 
  ungroup() %>% 
  mutate(word2 = reorder(word2, total_Valence_Mean)) %>% 
  ggplot(aes(word2, total_Valence_Mean, fill = tmp_ind)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~tmp_ind, nrow = 2, scales="free",dir="v") + 
  coord_flip()

單一斷詞情緒與 bigram 修正情緒結合
計算每一篇文章平均每一篇留言評論情緒值和亢奮值

A_commet_url_org_bigram2_percomment=A_commet_url_org_bigram%>% 
  group_by(url,commet_num) %>%
  summarise(negation_Valence_per_comment=sum(n*Valence_Mean),negation_Arousal_per_comment=sum(n*Arousal_Mean))


A_commet_url_org2_percomment=A_commet_url_org%>% 
  group_by(url,commet_num,month,find_word) %>%
  summarise(Valence_per_comment=sum(n*Valence_Mean),Arousal_per_comment=sum(n*Arousal_Mean)) %>% 
  left_join(A_commet_url_org_bigram2_percomment,by=c("url","commet_num")) %>% 
  mutate(Valence_per_comment=Valence_per_comment-2*ifelse(!is.na(negation_Valence_per_comment),negation_Valence_per_comment,0),
         Arousal_per_comment=Arousal_per_comment-2*ifelse(!is.na(negation_Arousal_per_comment),negation_Arousal_per_comment,0));

  
A_commet_url_org2=A_commet_url_org2_percomment %>% 
  group_by(url,month,find_word) %>%
  summarise(Mean_Comment_Valence_per_url=mean(Valence_per_comment),
            Mean_Comment_Arousal_per_url=mean(Arousal_per_comment),
            observation=n()) %>% filter(observation>=5)

只顯示主文和評論亢奮值較高的文章
可以觀察出文章內的內文都相對較激進
從大部分的文章，可以得出這些文章大部分確實是較為激進的文章和評論

A_url_comment_combine=A_url_org2 %>% ungroup %>% 
  inner_join(A_commet_url_org2,by=c("url","find_word")) %>% select(url,
                                                    Valence_per_url,Arousal_per_url,
                                                    Mean_Comment_Valence_per_url,
                                                    Mean_Comment_Arousal_per_url) %>% 
  filter(Arousal_per_url>=quantile(Arousal_per_url,0.7),
         Mean_Comment_Arousal_per_url>=quantile(Mean_Comment_Arousal_per_url,0.7)) %>% 
  mutate(hyper_link=paste0("<a href='",url,"'>","Go","</a>"));
  
tmp_A= noncommnet_A %>% select(url,title);
tmp_A2=A_url_comment_combine %>% left_join(tmp_A,by="url") %>% 
  mutate(hyper_link=paste0("<a href='",url,"'>","Go","</a>"));

plot_ly(data=tmp_A2,x=~Mean_Comment_Valence_per_url,y=~Valence_per_url,text = ~title,type = 'scatter',mode = 'markers') %>%
  add_trace(
    x=~(Mean_Comment_Valence_per_url-0.05),
    y=~(Valence_per_url-1.25),
    text=~hyper_link,
    mode = 'text',
    showlegend = F
  ) %>%
  layout(
    title = "ptt 每一篇文章",
    xaxis = list(title = "評論情緒值(越大越正面)"),
    yaxis = list(title = "正文情緒值(越大越正面)")
    )

計算每月平均每篇文章平均每篇評論的情緒值和亢奮值
解釋與主文同里

A_commet_url_org2=A_commet_url_org2_percomment %>% 
  group_by(url,month,find_word) %>%
  summarise(Mean_Comment_Valence_per_url=mean(Valence_per_comment),
            Mean_Comment_Arousal_per_url=mean(Arousal_per_comment),
            observation=n()) %>% filter(observation>=5)

A_commet_url=A_commet_url_org2%>% 
  group_by(month,find_word) %>% 
  summarise(Valence_url_comment_Mean=mean(Mean_Comment_Valence_per_url),
            Arousal_url_comment_Mean=mean(Mean_Comment_Arousal_per_url)
  )


all_month=seq(from=min(A_commet_url$month),to=max(A_commet_url$month),by=30) %>% floor_date("month") %>% unique()
all_month_2=all_month[seq(1,length(all_month),4)]


A_commet_url  %>% gather(line_type,Value,-month,-find_word) %>% 
  group_by(line_type,find_word) %>% mutate(max_Value=max(Value),min_Value=min(Value)) %>%
  ungroup %>% 
  #mutate(line_colour=ifelse(str_detect(line_type,"Valence"),"#CC6666", "#9999CC"))%>%
  ggplot()+
  geom_line(aes(x=month,y=Value,group=line_type,colour=line_type),size=1.5)+
  scale_x_date(labels = date_format("%Y/%m"),breaks = all_month_2)+
  geom_text(aes(month,Value,label=ifelse(Value==max_Value & line_type=="Valence_url_comment_Mean",as.character(month),"")),fontface="bold",size=4,color="blue")+
  geom_text(aes(month,Value,label=ifelse(Value==min_Value & line_type=="Valence_url_comment_Mean",as.character(month),"")),fontface="bold",size=4,color="red")+
  facet_wrap(~find_word, ncol = 1, scales = "free_y")+
  theme(axis.text.x = element_text(face="bold", color="#993333",  angle=45))

字詞 correlation 計算

捷運文章尋找與以下字詞前 10 大相關字詞
- 黃線: 較多文章或新聞都探討黃線是前瞻
- 運量: 捷運運量人數一直備受討論成本，與台北捷運差異
- 台鐵: 因為紅線延伸路線與台鐵路竹重疊較高

Url_words <- noncommnet_A1 %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>%
  count(url,find_word, word, sort = TRUE) %>% 
  filter(!(word%in%stop_words))%>% 
  filter(!str_detect(word,"^[0-9]{1,}\\.$"))


# 計算兩個詞彙間的相關性
word_cors=lapply(c("公車","捷運","輕軌"), function(tmp_find_word){
  word_cors_tmp <- Url_words %>% filter(find_word==tmp_find_word) %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  ungroup %>% 
  pairwise_cor(word, url, sort = TRUE)
  return(data.frame(find_word=tmp_find_word,word_cors_tmp))
}) %>% do.call(what="rbind")


word_cors %>% filter(find_word=="捷運") %>% 
  filter(item1 %in% c("黃線","運量","台鐵")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~item1 , scales = "free") +
  coord_flip()

## Selecting by correlation

輕軌文章尋找與以下字詞前 10 大相關字詞
- 大順路: 討論停車格與建輕軌容易塞車影響車流
- 工程: 交代目前輕軌第一階段和第二階段工程
- 交通: 討論輕軌要高架和地下化架設與否

word_cors %>% filter(find_word=="輕軌") %>% 
  filter(item1 %in% c("大順路","工程","交通","[錢]")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~item1 , scales = "free") +
  coord_flip()

## Selecting by correlation

使用網路圖呈現字詞相關性

公車字詞網絡
能看出討論公車免費或優惠的方案

seed_words=c("加註","板非","重覆多發","讀版","da","館藏","網頁","點定","要問","別忘了","記者")


threshold <- .65
remove_words <- word_cors %>% filter(find_word=="公車") %>%
                filter((item1 %in% seed_words|item2 %in% seed_words), correlation>threshold) %>%
                .$item1 %>%
                unique()
remove_words

##  [1] "點定"     "請參閱"   "無關"     "板非"     "重覆多發" "加註"    
##  [7] "關聯性"   "張貼"     "讀版"     "da"       "修改"     "性質"    
## [13] "館藏"     "要問"     "別忘了"   "附上"     "心情"     "網頁"    
## [19] "討論"     "高雄市民" "市民"     "報導"     "記者"     "開放"    
## [25] "方式"     "日期"

word_cors_new=word_cors %>% filter(find_word=="公車") %>% select(-find_word) %>% 
  filter((!(item1 %in% remove_words|item2 %in% remove_words))) 
word_cors_new %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

捷運字詞網絡
主要有討論黃線的建設、地下化、公車和成本等等
討論黃線的文章，大部分也會討論紅線延伸路竹與台鐵路竹重疊率高

word_cors_new=word_cors %>% filter(find_word=="捷運") %>% select(-find_word) %>% 
  filter((!(item1 %in% remove_words|item2 %in% remove_words))) 
word_cors_new %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

輕軌字詞網絡
輕軌主要討論建設和一些輕軌站點

word_cors_new=word_cors %>% filter(find_word=="輕軌") %>% select(-find_word) %>% 
  filter((!(item1 %in% remove_words|item2 %in% remove_words))) 
word_cors_new %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

文章-主題分析

捷運主體分析

因為公車本身主題不明顯，故主題分析只討論捷運和輕軌
捷運主體主要分成
- 高捷和北捷運量上的差異
- 黃線的規劃討論

ϕ matrix (topic * term)

#根據每一篇文章的Url給定一個id。
tmp_find_word="捷運"

reserved_word <- Url_words %>% 
  group_by(find_word,word) %>% 
  count() %>% 
  filter(n > 5) %>% ungroup %>% 
  filter(find_word==tmp_find_word) %>% .$word

Url_words_2 <- Url_words %>%
  mutate(artId = group_indices(., url)) %>% 
  filter(find_word==tmp_find_word,word%in%reserved_word)

Url_words_dtm <-Url_words_2 %>% cast_dtm(artId, word, n)
Url_words_lda <- LDA(Url_words_dtm, k =2, control = list(seed = 2000))


#從中可以得到特定主題生成特定詞彙的概率。
Url_words_topics <- tidy(Url_words_lda, matrix = "beta") # 注意，在tidy function裡面要使用"beta"來取出Phi矩陣。

Url_words_top_terms <- Url_words_topics %>%
  filter(! term %in% stop_words) %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


Url_words_top_terms %>% ungroup %>% 
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

θ matrix (document * topic)

#θ matrix (document * topic)
Url_words_documents <- tidy(Url_words_lda, matrix="gamma") # 在tidy function中使用參數"gamma"來取得 theta矩陣。
Url_words_documents

## # A tibble: 1,122 x 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 592          1 0.999   
##  2 727          1 0.000509
##  3 759          1 0.111   
##  4 619          1 0.00170 
##  5 1368         1 0.000587
##  6 290          1 0.481   
##  7 886          1 0.00135 
##  8 260          1 0.999   
##  9 693          1 0.0857  
## 10 1149         1 0.117   
## # ... with 1,112 more rows

#gamma值代表的是這篇文章中有多少比例的詞是出自於特定topic

簡單的主題分類

從標題可以簡單看出確實和我們猜測的主題相符

Url_words_2$artId <- as.character(Url_words_2$artId)

Url_words_documents%>%
  group_by(topic) %>%
  top_n(10, wt=gamma) %>%
  inner_join(Url_words_2, by = c("document" = "artId")) %>%
  distinct(url) %>%
  inner_join(noncommnet_A, by =c("url"="url")) %>%
  select(topic, title) %>% DT::datatable()

按照θ矩陣的“gamma”值劃分兩種主題的資料

Url_words_documents_spread <- Url_words_documents %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, gamma)

half_num = round(nrow(Url_words_documents_spread)/2) # 原始文章數量的一半

topic1_id <- Url_words_documents_spread %>%   # 取出topic_1最高的half_num篇文章
  top_n(half_num, topic1) %>%
  select(document) %>%
  unlist()

topic2_id <- Url_words_documents_spread$document %>%
  setdiff(topic1_id)

Topic 1

#install.packages("wordcloud2")
require(wordcloud2)
word_cloud_topic_1 <- Url_words_2 %>%
  filter(artId %in% topic1_id) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_1, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc7.png", delay=10)

Topic 2

word_cloud_topic_2 <- Url_words_2 %>%
  filter(artId %in% topic2_id  ) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_2, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc8.png", delay=10)

輕軌主體分析

輕軌主題主要分成
- 輕軌經過大順路會塞車，希望高架或地下化
- 輕軌的營運、建設和工程情況 ### ϕ matrix (topic * term)

#根據每一篇文章的Url給定一個id。
tmp_find_word="輕軌"

reserved_word <- Url_words %>% 
  group_by(find_word,word) %>% 
  count() %>% 
  filter(n > 5) %>% ungroup %>% 
  filter(find_word==tmp_find_word) %>% .$word

Url_words_2 <- Url_words %>%
  mutate(artId = group_indices(., url)) %>% 
  filter(find_word==tmp_find_word,word%in%reserved_word)

Url_words_dtm <-Url_words_2 %>% cast_dtm(artId, word, n)
Url_words_lda <- LDA(Url_words_dtm, k =2, control = list(seed = 2000))


#從中可以得到特定主題生成特定詞彙的概率。
Url_words_topics <- tidy(Url_words_lda, matrix = "beta") # 注意，在tidy function裡面要使用"beta"來取出Phi矩陣。

Url_words_top_terms <- Url_words_topics %>%
  filter(! term %in% stop_words) %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


Url_words_top_terms %>% ungroup %>% 
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

θ matrix (document * topic)

#θ matrix (document * topic)
Url_words_documents <- tidy(Url_words_lda, matrix="gamma") # 在tidy function中使用參數"gamma"來取得 theta矩陣。
Url_words_documents

## # A tibble: 1,116 x 3
##    document topic   gamma
##    <chr>    <int>   <dbl>
##  1 614          1 1.000  
##  2 1078         1 0.789  
##  3 350          1 0.372  
##  4 915          1 0.159  
##  5 381          1 0.0387 
##  6 1149         1 0.725  
##  7 977          1 0.832  
##  8 1175         1 0.999  
##  9 80           1 0.00846
## 10 446          1 0.00170
## # ... with 1,106 more rows

#gamma值代表的是這篇文章中有多少比例的詞是出自於特定topic

簡單的主題分類

Url_words_2$artId <- as.character(Url_words_2$artId)

Url_words_documents%>%
  group_by(topic) %>%
  top_n(10, wt=gamma) %>%
  inner_join(Url_words_2, by = c("document" = "artId")) %>%
  distinct(url) %>%
  inner_join(noncommnet_A, by =c("url"="url")) %>%
  select(topic, title) %>% DT::datatable()

按照θ矩陣的“gamma”值劃分兩種主題的資料

Url_words_documents_spread <- Url_words_documents %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, gamma)

half_num = round(nrow(Url_words_documents_spread)/2) # 原始文章數量的一半

topic1_id <- Url_words_documents_spread %>%   # 取出topic_1最高的half_num篇文章
  top_n(half_num, topic1) %>%
  select(document) %>%
  unlist()

topic2_id <- Url_words_documents_spread$document %>%
  setdiff(topic1_id)

Topic 1

#install.packages("wordcloud2")
require(wordcloud2)
word_cloud_topic_1 <- Url_words_2 %>%
  filter(artId %in% topic1_id) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_1, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc9.png", delay=10)

Topic 2

word_cloud_topic_2 <- Url_words_2 %>%
  filter(artId %in% topic2_id  ) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_2, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc10.png", delay=10)

結論

針對民眾在PTT討論版上，分成三個主題公車、輕軌、捷運進行討論分析。
1. 公車: 對於公車的討論度較低，但由這些文章可以發現一卡通優惠免費措施討論熱度佔一定比例。
2. 捷運: 議題分成兩主題。第一個為高雄民眾對於捷運黃線的期待與黃線地下化的討論，另外為高雄捷運紅線岡山路竹延伸線計劃和台鐵有許多佔點重疊。
3. 輕軌: 捷運局表示，輕軌大順路可用來替代並滿足尖峰停車需求。但對於施工期間的車流疏通，民眾對此感到堪憂。

社群媒體分析-期中報告(第八組)

D052040002 蔡承翰

M072040001 蔡伊婷

資料集合

目錄

動機與分析目的

資料讀取

字典讀取

情緒詞典介紹

資料前處理

發文數量

文字雲

公車刪除停用詞文字雲

捷運刪除停用詞文字雲

輕軌刪除停用詞文字雲

所有運具前 20 大出現字詞

公車(只保留路線、站點和路名)

捷運(只保留路線、站點和路名)

輕軌(只保留路線、站點和路名)

所有運具前 20 大出現字詞(只保留路線、站點和路名)

Tf-Idf 查看重要的字

情緒分析

主文情緒分析

評論情緒分析

字詞 correlation 計算

使用網路圖呈現字詞相關性

文章-主題分析

捷運主體分析

ϕ matrix (topic * term)

θ matrix (document * topic)

簡單的主題分類

輕軌主體分析

θ matrix (document * topic)

簡單的主題分類

結論