資料集合

目錄

require(data.table)
require(dplyr)
require(readr)
require(dplyr)
require(stringr)
require(jiebaR)
require(tidytext)
require(NLP)
require(tidyr)
require(ggplot2)
require(ggraph)
require(igraph)
require(scales)
require(reshape2)
require(widyr)
require(knitr)
library(kableExtra)
library(data.table)
require(ropencc)
require(wordcloud2)
require(lubridate)
require(htmlwidgets)
require(webshot)
require(plotly)
require( RColorBrewer)

require(servr)
require(tm)
require(data.table)
require(stringr)
library(topicmodels)
require(LDAvis)
require(webshot)
## [1] "Chinese (Traditional)_Taiwan.950"

動機與分析目的

資料讀取

Sys.setlocale("LC_CTYPE", "cht")
## [1] "Chinese (Traditional)_Taiwan.950"
commnet_A=fread("comment_Kaohsiung.csv",encoding = "UTF-8") %>% filter(find_word!="bike");
noncommnet_A=fread("non_comment_Kaohsiung.csv",encoding = "UTF-8")%>% filter(find_word!="bike");

kable(noncommnet_A %>% head()%>% select(-ip,-imp_word_count,-route_imp_word_count)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "500px")
url find_word text author title time board_content board_content_type comment_num
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html 公車

from 一卡通 ipass粉絲團

一卡通免費搭市區公車延長到6月底

高雄板非心情板、個板,無關高雄廣泛市民之文章請勿張貼。 這裡是高雄板,請討論與多數高雄市民有直接關聯性的文章。

修改文章標題請於文章列表按大寫T 同性質文章請以大E修改內文方式加註,請勿於板上重覆多發

◆ From:
scottlu28 (D) [閒聊] 一卡通免費搭公車延長到9月底 2014-02-25T05:47:15Z Kaohsiung 閒聊 34
https://www.ptt.cc/bbs/Kaohsiung/M.1393928042.A.3B5.html 公車

網誌好讀版:

前鎮分館

前言:

個月推出一卡通搭市區公車免費。本以為這活動與我無關,後來看到有人認真分享 搭免費公車通勤的心得,才開始思考能用這資源做些什麼。

到那幾條文化公車,偏偏文化公車不在免費範圍,要錢哪!想了幾天,思路不知怎 麼接上的,突然冒出了「圖書館」。

公車應該都到得了。再者自幾年前三民分館、寶珠分館空間改造變得很好用之後, 對其他分館也有了些好奇心,現在有免費公車,似乎是個好機會?去看看似乎無妨?

逛上一次,來個「搭免費公車逛圖書館」!動機和可行性皆備,時間就訂在日檢後 出發!

Day 1 102年12月3日 (二)晴 行程:前鎮分館→小港分館→林園二館→林園分館

第一站:前鎮分館

交通:────>〔一心一路口〕───>〔公正路口〕───>前鎮分館

得先走上一小段才能轉乘,於是眼睜睜地看著對向、也就是我要搭的83公車離開! 用語音查詢得知下一班要再等15分鐘,又想到下一站[瑞祥高中]有公共腳踏車,便

  抵達瑞祥高中找到公共腳踏車,離83公車到來還有一小段餘裕,想想還是搭公 車吧!便往最近的站牌走去。豈知,就在走到剩沒幾公尺、都已經能看見站牌了,

  所以實際上是:     168東          走         公共腳踏車    ────>〔一心一路口〕───>〔瑞祥高中〕───────>前鎮分館

雜記:   這是整個計劃的第一站!圖書館旅行就從這裡開始!

樓梯往上走到閱覽室。走進圖書館,映入眼睛的是一艘獨木舟,頓時還以為跑到那 瑪夏或桃源分館去了!明明剛剛是騎公共腳踏車來的呀!

物和館藏特色「原住民資料」,有很多沒想過的原住民書籍,有探討研究的、也有 輕鬆入門的,比如原住民神話繪本,整整列了兩排書櫃。

能看到,很容易搜尋。同為館藏特色的投資理財則倚著牆,不過因為這類書在各分 館的量都不少,就不另外借書,只從「原住民資料」區借了一本「巴斯達隘傳說」。

照片不多。接著在大樓外頭請路人幫忙拍了一張到此一遊,不小心聊太久,還腳踏 車扣了10元。第一站就花到錢了!說好的不花錢呢……?

開放時間:週二至週六 09:00-21:00;週日 09:00-17:00 總坪數:599坪 座位數:450席位 服務電話:07-7173008、07-7173009 [email protected] 傳真電話:07-7174440

– ◆ From:
謝謝各位。目前輕旅行已經告一段落,會慢慢把遊記整理出來。
wildphoenix (小不點) [遊記] 搭免費公車遊圖書館:前言&前鎮分館 2014-03-04T10:13:59Z Kaohsiung 遊記 4
https://www.ptt.cc/bbs/Kaohsiung/M.1394096518.A.EA9.html 公車

網誌好讀版

Day 1 102年12月3日 (二)晴

    公共腳踏車         69        公共腳踏車

  回到瑞祥高中還公共腳踏車,一台預定搭的紅12從旁邊跑掉,索性直接騎到捷 運凱旋站再還。還車時耍笨,沒注意到袋子的提把勾到腳踏車就上架,只好再借再

借還車並順利衝到前方幾公尺的站牌呢?   就在完成還車程序時,綠燈也亮起,不加思索拔腿狂奔,眼看公車啟動、即將

在我前方停下打開車門!原來是好心的公車司機要讓我上車!太感謝了。   往小港的路上人不多,廟很多,有一小段路就有三間廟,而且規模不小。到小

,等走回餐旅學院已經是半小時後了……

  小港分館實在太有特色!不只館內東逛西逛,館外也好好地繞上一圈,明明從 館外繞不覺得大,進到裡頭又不覺得小。館舍左方有停車場和一小塊庭園,外觀有

  進到裡頭徵求拍照許可才知道原來小港分館是得過獎的綠建築,常有人來此拍 照,難怪申請表格早早就備好,填完表格也有攝影證可佩戴。興奮之下到處亂拍,

  從樓梯走上二樓即可看到牆上掛著天文資料,看來是館藏特色「天文」的一部 份;同一樓另外有個由企業認養的空間,推廣環保和介紹鳳山丘陵及生態,第一次

  館內空間充足,不過藏書還不多,書架沒有擺滿,空書架也不少,要再過幾年 才會比較有書的味道吧!可是有很棒的採光(雖然照片看不出來 Orz),每樓都有

感覺心都靜了下來。   整棟下來逛下來似乎沒看見電梯?正要詢問館員時瞥見了。原來在角落,向外

有待加強啊!   小港分館實在太有特色!一不小心就多待了,導致還腳踏車時又被扣了30元,

開放時間:週二至週六09:00-21:00;週 日09:00-17:00(99/11/12日開館) 休館日期:週一及國定假日休館 座位數:閱覽席次218位 館藏量:5萬多冊 傳真電話:801-2226 地  址:小港區博學路365號 –
wildphoenix (小不點) [遊記] 搭免費公車遊圖書館:小港分館 2014-03-06T09:01:55Z Kaohsiung 遊記 2
https://www.ptt.cc/bbs/Kaohsiung/M.1394442060.A.13D.html 公車

網頁好讀版

Day 1 102年12月3日 (二)晴

   ───>〔林園區公所〕──>林園分館

,所以沒撲空。

  這站有友人加入,但因為我的車較晚到,結果還是各逛各的……   詢問拍照許可時不小心引出主任,說明原因後介紹了一本「小地方大書房:98

師在改造空間時的想法。有機會要去找來看看寶珠分館和陽明分館的介紹。   同樣地,林園分館分別在入口和廁所前方展示了書籍整理和館舍改造的照片故

  像是友善的行動動線(入口坡道、電梯、樓梯扶手、粗糙不滑的梯面、反光條 )、二樓的社區文化展示,以及大大的兒童閱覽區。聽說地下室還有?可惜不開放

  新書展示區和開卷主題合在一起,整個陳列區很大,一進門一定會被吸引過來 佇足,找找有無喜歡的書。我也在這找到「天生就會跑」,現在還不是借的時候,

  二樓書庫分左右兩間,都不大,但wifi訊號很強很好用。三樓是自修室和館藏 特色「紅樹林」區,不過一般借書只會爬到二樓吧!會到三樓多半是去自修室,感

不定。

  逛完林園分館才和友人會合,解決晚餐,並在18:44搭末班橘9公車回家,一 邊可惜來不及到大寮分館,順便發現橘9的路線可到4間分館,好像可以來研究一下

行程幾乎可以決定了!回程經過陽明分館,可惜時間太晚、已過開放時間,只好留 到下次。

檢討:   第一天行程的事先規劃做得不夠,只查了搭哪些車、沒查時刻表便匆匆上路,

真的如預定再去大寮分館,肯定會困在大寮、沒公車回家。   也因為太隨性規劃不足,交通上多花了不少時間,一站一站累積下來,回到家

  

– 林園分館

休館日期:週一、國定假日 總坪數:1182.31平方公尺 館藏量:約4萬5千多冊 服務電話:07-6431419 [email protected] 地  址:高雄市林園區林園北路236號

◆ From:
wildphoenix (小不點) [遊記] 搭免費公車遊圖書館:林園分館 2014-03-10T09:00:57Z Kaohsiung 遊記 2
https://www.ptt.cc/bbs/Kaohsiung/M.1394615924.A.0EE.html 公車

網頁好讀版

Day 2 102年12月6日

    紅33B         橘7B          走     7:18         7:51    約9:08到

  搭橘7B公車很好玩,乘客和司機似乎彼此是熟面孔,可以聽到他們話家常, 聊前一天末班車載幾個人啦、某個學校的老師也搭末班車或大樹祈福線啦、某校

  於是我大膽向司機請教到大樹果菜市場是否要在溪埔派出所站下車,結果司 機和乘客都表示不知道,只知道姑山倉庫、姑婆寮,都不是我要去的。只好繼續

讓我在大樹區公所問清楚再繼續往前開!怎麼這麼親切好心啊!   順帶一提,中途在只有去程才會開過去的和山寺有短暫的休息時間,聽到司

雜記:

有心理準備待上一整天直到閉館;如果不想整天被困在同一處,就要有靠雙腳走 五公里進入橘7A路線的覺悟。因為二個覺悟都沒有,曾經一度考慮放棄大樹三館

  根據站站時刻表,公車從溪埔派出所到終點佛光山,再從那邊回程到溪埔派 出所有短短的約14分鐘的空檔,辦得到!抓緊時間下車後快速問路、衝到圖書館

  只是時間真的很趕,來不及借書拍到此一遊照,館內也未能多加留意,可惜 ,只好留待下次了。不過要是這空檔無法增加,恐怕再來幾次也還是無法靜下心

  到的時候其實滿早的,才剛開館,館員阿姨正在夾報紙, 似乎有被我嚇到, 如果不是那麼趕,有記得跟她道早安就好了。

大樹三館 開放時間:週三至週日09:00-17:00 休館日期:週一、二、國定假日 座位數:42 位 館藏量:約9千多冊 傳真電話:6564272 地  址:84049大樹區溪埔里溪埔路3巷60-1號2樓 –
wildphoenix (小不點) [遊記] 搭免費公車遊圖書館:大樹三館 2014-03-12T09:18:42Z Kaohsiung 遊記 4
https://www.ptt.cc/bbs/Kaohsiung/M.1394803517.A.1A9.html 公車

網頁好讀版

Day 2 102年12月6日(五)晴

鳳山曹公分館

    橘7B         走     9:22

究站牌,試圖找出公路客運的替代方式,不過……全部都沒車是怎麼一回事?所以 到大樹分館也比預定的晚,原先預估的停留時間不夠用。

樹三館和果菜市場的位置,聽司機說果菜市場那邊平常經過都關著,只有鳳荔節左 右才開,難怪司機就不知道那邊便是我要去的果菜市場和圖書館,大概只有在地人

  這位司機真的很熱心!去程問路雖然不知道,也沒把我丟著不管,回程又載到 我也關心了一下。而且還記得我一開始是從長庚上車的,雖然據說是會搭這路的乘

雜記:

聊了一下,原來剛開始實施也被抱怨過,不過久了習慣後,發現脫鞋入館可以隨時 席地閱讀也不怕髒,不只是兒童閱覽區,是整館都能以地為席。若是假日就能親眼

  聽說大樹分館原本在山上,沒什麼人願意特地上山借書,搬到現址後方便得多 ,讀者也多起來了。二樓自修室也很特別,有一方陽台可提供給不喜歡冷氣的人使

我也心動想在這個陽台讀一次書回味回味。當然,當年的考試壓力就免了!

大樹分館 開放時間:週二至週六09:00-21:00週日9:00-17:00 休館日期:週一、國定假日 座位數:60位 館藏量:約2萬多冊 傳真電話:07-6517634 地  址:84047大樹區中興西路50號 – 還好橘7在附近有站,還是到得了。 耍笨了,謝謝提醒。 免費公車還有三個多月,可以先從高雄的大學開始。
wildphoenix (小不點) [遊記] 搭免費公車遊圖書館:大樹分館 2014-03-14T13:25:14Z Kaohsiung 遊記 4
kable(commnet_A %>% head() %>% select(-ip,-author,-time,-comment_ip)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "500px")
url find_word commet_num text comment_type title board_content board_content_type comment_author comment_time
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html 公車 69025 這樣不會狂虧損嗎? 雖然開心免費這件事 [閒聊] 一卡通免費搭公車延長到9月底 Kaohsiung 閒聊 Alex1103 2014-02-25T08:03:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html 公車 69026 到底是要免費多久啊=口= [閒聊] 一卡通免費搭公車延長到9月底 Kaohsiung 閒聊 blankhole 2014-02-25T07:13:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html 公車 69027 推阿~~希望公車能再進步些 [閒聊] 一卡通免費搭公車延長到9月底 Kaohsiung 閒聊 cy2013 2014-02-25T12:19:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html 公車 69028 讚! [閒聊] 一卡通免費搭公車延長到9月底 Kaohsiung 閒聊 daisukidayo 2014-02-26T10:59:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html 公車 69029 最近看到越來越多人拿一卡通道OK小額消費了好現象 [閒聊] 一卡通免費搭公車延長到9月底 Kaohsiung 閒聊 dreamcoast 2014-02-25T15:47:00Z
https://www.ptt.cc/bbs/Kaohsiung/M.1393307238.A.EF8.html 公車 69030 大推! 繼續搭公車去旅行~ [閒聊] 一卡通免費搭公車延長到9月底 Kaohsiung 閒聊 familia 2014-03-01T01:55:00Z

字典讀取

stop_words_org <- readLines("./dict/stop_words.txt", 
                   encoding='utf-8')
## Warning in readLines("./dict/stop_words.txt", encoding = "utf-8"): 於 './
## dict/stop_words.txt' 找到不完整的最後一列
Encoding(stop_words_org)<-"UTF-8"
new_stop=c("板規","公車","捷運","輕軌","高雄市","高雄","更新","知道","謝謝","刪除","不過","目前","如果",
           "指出","文章","發現","請問","真的","請於","這種","內文","大寫","這裡","列表","標題","只有","其實",
           "有人","修改文章","xd")
stop_words=c(stop_words_org,new_stop)


station_name=readLines("./Station_data/station_name.txt")
route_name_org=readLines("./Station_data/route_name.txt")
route_name=ifelse(!str_detect(route_name_org,"[^0-9]"),paste0(route_name_org,"路"),route_name_org)

mrt_all_name=readLines("./Station_data/mrt_station_name.txt")
mrt_all_code=readLines("./Station_data/mrt_station_code.txt")

place_name=readLines("./Station_data/place_name.txt")
road_name=readLines("./Station_data/road_name.txt")

station_route_road=c(station_name,route_name,mrt_all_name,mrt_all_code,place_name,road_name)
#station_route_road=c(station_name,route_name,mrt_all_name,mrt_all_code,place_name,road_name)

my_word=readLines("./Station_data/my_dict.txt")


T1=rbind(cbind(station_name,"站點名稱"),
      cbind(route_name,"路線名稱"),
      cbind(c(mrt_all_name,mrt_all_code),"捷運相關"),
      cbind(road_name,"高雄路名稱")) %>% as.data.frame()
colnames(T1)=c("word","type")

T1=T1 %>% mutate(word=iconv(word,"big5","UTF-8"),type=iconv(type,"big5","UTF-8"))
T1 %>% DT::datatable()
# %>%
#   kable_styling(bootstrap_options = "striped", full_width = F) %>%
#   scroll_box(height = "500px",width="300px")

情緒詞典介紹

my_sentiment=readLines("./dict/my_sentiment_word.txt")

#情緒字眼
CVAW3_org=read.csv("./dict/CVAW3.csv",header=T,stringsAsFactors = FALSE)
CVAW3_org=CVAW3_org %>% mutate(Word=iconv(Word,"big5","UTF-8"))

plot_ly(data=CVAW3_org,x=~Valence_Mean,y=~Arousal_Mean,text = ~Word,
        marker = list(size = 7,
                       color = 'rgba(182, 192, 255, .9)',
                       line = list(color = 'rgba(0, 0, 152, .8)',
                                   width = 2))) %>%
  layout(
    title = "CVAW3",
    xaxis = list(title = "情緒值"),
    yaxis = list(title = "亢奮值")
    )
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
CVAW3=CVAW3_org %>% mutate(Valence_Mean=Valence_Mean-5,Arousal_Mean=Arousal_Mean-5) %>% 
  filter(Word%in%my_sentiment)



sentiment_delete_word=c("優惠","免費","才能","凱旋","很多")

CVAW3=CVAW3 %>% filter(!(Word%in%sentiment_delete_word)) 
sentiment_word=CVAW3$Word

all_word=c(station_name,route_name,mrt_all_name,mrt_all_code,place_name,road_name,my_word,sentiment_word)

資料前處理

time_change_format<-function(text_tmp){
  text_tmp=gsub("[0-9]{1,2}[::][0-9]{2}","[時間]",text_tmp);
  text_tmp=gsub("[0-9]{1,2}(分鐘|分|秒)","[時段]",text_tmp);
  text_tmp=gsub("[0-9]{1,}[\\.]?[0-9]{1,}[ ]?(公里|公尺|km|m)","[距離]",text_tmp);
  #text_tmp=gsub("\\$[0-9]{1,}","錢alpha",text_tmp);
  text_tmp=tolower(text_tmp);
  text_tmp=gsub("[0-9\\.一二三四五六七八九億萬千百十k]{1,}(元|\\$)","[錢]",text_tmp);
  text_tmp=gsub("\\$[0-9\\.一二三四五六七八九億萬千百十k]{1,}","[錢]",text_tmp);
  text_tmp=gsub("\\-","",text_tmp);
  text_tmp=gsub(paste0("(",paste(route_name_org,collapse = "|"),")(公車|號)"),"\\1路",text_tmp);
  return(text_tmp)
}

commnet_A=commnet_A  %>%
  filter(str_detect(text, regex("[\u4e00-\u62ff\u6300-\u77ff\u7800-\u8cff\u8d00-\u9fff]"))) %>% 
  mutate(text=time_change_format(text))


noncommnet_A=noncommnet_A %>% mutate(text=gsub("\n|\\s+","",text)) %>% 
  mutate(text=time_change_format(text))

kable(noncommnet_A  %>% filter(str_detect(text,"\\[時段\\]")) %>% arrange(nchar(text)) %>% head()) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "500px")
url find_word text ip author title time board_content board_content_type comment_num imp_word_count route_imp_word_count
https://www.ptt.cc/bbs/Kaohsiung/M.1420994736.A.597.html 捷運 如果搭捷運至小港站步行到小港醫院會很遠嗎?查過辜狗地圖顯示步行[時段]左右請問這一段路真的[時段]能走到嘛?謝謝 1.173.156.29 Lucialover (來日方長) [問題] 捷運小港站至小港醫院走路會很遠嗎? 2015-01-11T16:45:33Z Kaohsiung 問題 17 9 0
https://www.ptt.cc/bbs/Kaohsiung/M.1411517239.A.D41.html 公車 剛剛要搭這班車,[時間]分的,結果在凹子底沒有等到車,看牌子是預計[時間]分會到在想說是我自己錯過嗎??可是我[時段]前就在等車了 te040928 (Mika) [問題] 301路公車 2014-09-24T00:07:16Z Kaohsiung 問題 6 5 0
https://www.ptt.cc/bbs/Kaohsiung/M.1491981832.A.BFC.html 公車 我自己比較常搭的是民族幹線90路和168東西線這三條件是每班次間距大概[時段]左右比較頻繁的路線因為接下來是雨季搭公車還是比較安全一點謝謝 180.217.236.186 HIKONA (《開始游牧民族》) [問題] 高雄班次比較多的公車 2017-04-12T07:23:49Z Kaohsiung 問題 26 10 4
https://www.ptt.cc/bbs/Kaohsiung/M.1491456320.A.0BE.html 捷運 最近要去小港喝喜酒,地點離小港捷運站開車約時段,想說坐捷運到小港站,再租機車or搭小黃過去!請問此2種方式可行? wagotow (^^) [問題] 捷運小港站附近有出租機車or計程車排班? 2017-04-06T05:25:17Z Kaohsiung 問題 3 2 0
https://www.ptt.cc/bbs/Kaohsiung/M.1506861752.A.C7D.html 捷運 捷運真的很慢臺鐵還比較快而開車又太久只要設下去左營南科間[時段]天天通勤不是問題(北竹都有人通勤了月票才7千)就看市府有沒有前瞻去推了這樣南高合作不是很好嗎 dolare (dolare~) Re: [問題] 台積電不來,路竹捷運怎麼辦?? 2017-10-01T12:42:30Z Kaohsiung 問題 30 4 0
https://www.ptt.cc/bbs/Kaohsiung/M.1543131674.A.9A4.html 捷運 幹嘛蓋?不是說高雄一堆蚊子館只會借廁所嗎?新市長肯定要拆掉輕軌、捷運改建藝文中心成為廁所亞洲新灣區所有建設停工全面拼經濟在高雄政治[時段]請到台北抗爭討論市長政見叫崩潰? 223.138.6.24 TWNchurchill (金城武) Re: [閒聊] 捷運輕軌還會蓋嗎 2018-11-25T07:41:12Z Kaohsiung 閒聊 88 10 0

發文數量

noncommnet_A1=noncommnet_A %>% mutate(time=gsub("T|Z","",time))%>% 
  mutate(time=parse_date_time(time,"%Y-%m-%d %H-%M-%S"))%>% 
  mutate(month=floor_date(time,"month") %>% as.Date()) %>% 
  mutate(bus_route_ind=str_detect(text,paste0("(",paste(c(route_name),collapse = "|"),")")))

noncommnet_A1  %>% group_by(month,find_word) %>% summarise(count=n()) %>% 
  group_by(find_word) %>% mutate(max_count=max(count)) %>% 
ggplot(aes(x= month,y=count,fill=find_word)) +
geom_col(show.legend = FALSE) +
  scale_x_date(labels = date_format("%Y/%m/%d")) +
  facet_wrap(~find_word, ncol = 1, scales = "free_y")+
  geom_text(aes(label=ifelse(count==max_count,as.character(month),"")),fontface="bold",size=4)

文字雲

公車刪除停用詞文字雲

因公車較無探討議題,PTT內文主要是評論司機、班次誤點、一卡通優惠。

jieba_tokenizer = worker()
new_user_word(jieba_tokenizer, c(all_word,"[時間]","[時段]","[距離]","[錢]"))
## [1] TRUE
chi_tokenizer <- function(t) {
  lapply(t, function(x) {
    if(nchar(x)>1){ # 新增的地方
      tokens <- segment(x, jieba_tokenizer)
      # 去掉字串長度爲1的詞彙
      tokens <- tokens[nchar(tokens)>1]
      return(tokens)
    }
  })
}

A_place <- noncommnet_A1 %>% select(month,find_word,text)  %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>%
  #filter(!str_detect(word, regex("[a-zA-Z]"))) %>%
  count(find_word,word, sort = TRUE)  %>% 
  filter(!(word %in% stop_words)) 

A_place_filter=A_place%>% filter(word%in%station_route_road)


A_place %>% filter(find_word=="公車") %>% group_by(word)  %>% summarise(count=sum(n))%>% 
  top_n(100,count)  %>% 
  filter(str_detect(word,"[^0-9]")) %>% 
  wordcloud2() -> tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc1.png", delay=10)

捷運刪除停用詞文字雲

  • 主要探討議題有兩個,一個為北捷與高捷運量比較。
  • 另一個是捷運黃線的高架化(地下化)探討。
A_place %>% filter(find_word=="捷運") %>% group_by(word)  %>% summarise(count=sum(n)) %>%
  top_n(100,count) %>% 
  filter(str_detect(word,"[^0-9]")) %>% 
  wordcloud2() ->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc2.png", delay=10)

輕軌刪除停用詞文字雲

輕軌於大順路的未來營運效益及興建交通的衝擊。

A_place %>% filter(find_word=="輕軌") %>% group_by(word)  %>% summarise(count=sum(n))%>% 
  top_n(100,count) %>% 
  filter(str_detect(word,"[^0-9]")) %>% 
  wordcloud2() -> tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc3.png", delay=10)

所有運具前 20 大出現字詞

A_place %>% group_by(find_word,word)  %>% summarise(count=sum(n))%>%
  group_by(find_word) %>% 
  top_n(20,count) %>% 
  ungroup %>% 
  mutate(word = reorder(word, count)) %>% 
  ggplot(aes(word, count, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

公車(只保留路線、站點和路名)

  • 100路、三多商圈較靠近市區,因此討論度較高。
  • 中華幹線、168東則是跨越多個行政區因此也為民眾討論度也頗高。
A_place_filter  %>% filter(find_word=="公車") %>% group_by(word)  %>% summarise(count=sum(n)) %>% 
  wordcloud2() ->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc4.png", delay=10)

捷運(只保留路線、站點和路名)

  • 在捷運規畫中藍線與黃建交會於大順路口,因此大順路教會明顯。
  • 2018年11月左右,行政院長賴清德今天表示「要讓高雄捷運延伸至台南」使得台南次數變高。
  • 紅線為(岡山-小港) 、橘線為(大寮-西子灣)民眾描述捷運紅、橘線也可會使用岡山-小港、大寮-西子灣表示,因此此四個字使用頻率較高。
A_place_filter %>% filter(find_word=="捷運") %>% group_by(word)  %>% summarise(count=sum(n)) %>% 
  wordcloud2()->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc5.png", delay=10)

輕軌(只保留路線、站點和路名)

  • 輕軌站名:哈瑪星、夢時代、凱旋、籬內仔
  • 輕軌經過大順路受到許多爭議。
A_place_filter%>% filter(find_word=="輕軌") %>% group_by(word)  %>% summarise(count=sum(n)) %>% 
  wordcloud2()->tmp
# saveWidget(tmp, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc6.png", delay=10)

所有運具前 20 大出現字詞(只保留路線、站點和路名)

A_place_filter %>% group_by(find_word,word)  %>% summarise(count=sum(n))%>%
  group_by(find_word) %>% 
  top_n(20,count) %>% 
  ungroup %>% 
  mutate(word = reorder(word, count)) %>% 
  ggplot(aes(word, count, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

Tf-Idf 查看重要的字

Url_words <- noncommnet_A1 %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>%
  filter(!str_detect(word, regex("[a-zA-Z]"))) %>%
  count(find_word,url, word, sort = TRUE)

Url_words_tf_idf <- Url_words %>%
  bind_tf_idf(word, url, n) %>% 
  filter(!(word%in%stop_words))

# 選每篇文章,tf-idf最大的十個詞,
# 並查看每個詞被選中的次數
Url_words_tf_idf%>% 
  group_by(url,find_word) %>%
  top_n(10,wt=tf_idf) %>%
  arrange(desc(url)) %>%
  group_by(find_word) %>%
  count(word, sort=TRUE) %>%
  top_n(20)%>%
  ungroup %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()
## Selecting by n

情緒分析

讀取 negation word

negation_word_tmp=readLines("./dict/negation_word/negation_word.txt",encoding = "UTF-8")
negation_word_tmp2=strsplit(negation_word_tmp,split="\t") %>% lapply( function(x)x[1]) %>% unlist
negation_word_tmp2_ind=strsplit(negation_word_tmp,split="\t") %>% lapply( function(x)x[2]) %>% unlist
ccst = converter(S2T)
negation_word_tmp3=c(ccst[negation_word_tmp2[negation_word_tmp2_ind==0]][1:220],"不應該","不符")
negation_word_tmp3
##   [1] "非"       "別"       "不"       "沒"       "無"       "勿"      
##   [7] "別出"     "別打"     "別帶"     "別動"     "別發"     "別煩"    
##  [13] "別放"     "別管"     "別開"     "別看"     "別來"     "別理"    
##  [19] "別亂"     "別弄"     "別惹"     "別是"     "別說"     "別提"    
##  [25] "別學"     "並不"     "不必"     "不成"     "不出"     "不大"    
##  [31] "不當"     "不到"     "不得"     "不等"     "不屌"     "不定"    
##  [37] "不懂"     "不動"     "不對"     "不多"     "不放"     "不該"    
##  [43] "不敢"     "不幹"     "不高"     "不搞"     "不給"     "不夠"    
##  [49] "不管"     "不行"     "不好"     "不合"     "不和"     "不會"    
##  [55] "不及"     "不佳"     "不見"     "不講"     "不可"     "不肯"    
##  [61] "不快"     "不了"     "不理"     "不利"     "不良"     "不靈"    
##  [67] "不買"     "不滿"     "不明"     "不能"     "不清"     "不去"    
##  [73] "不全"     "不讓"     "不認"     "不容"     "不如"     "不勝"    
##  [79] "不是"     "不說"     "不算"     "不太"     "不通"     "不妥"    
##  [85] "不像"     "不信"     "不需"     "不許"     "不要"     "不應"    
##  [91] "不用"     "不願"     "不再"     "不在"     "不值"     "不止"    
##  [97] "不住"     "不準"     "不足"     "不做"     "從不"     "都不"    
## [103] "而不"     "還不"     "還沒"     "毫不"     "毫無"     "很不"    
## [109] "決不"     "絕不"     "沒錯"     "沒法"     "沒空"     "沒錢"    
## [115] "沒人"     "沒用"     "沒有"     "請別"     "無法"     "無關"    
## [121] "無效"     "無需"     "無用"     "永不"     "又不"     "又沒"    
## [127] "沒戲"     "阻止"     "制止"     "遠非"     "絕非"     "決非"    
## [133] "否認"     "否定"     "難以"     "短缺"     "缺場"     "缺數"    
## [139] "稀缺"     "缺少"     "缺貨"     "未能"     "未定"     "請勿"    
## [145] "勿食"     "很差"     "差勁"     "差評"     "不堪"     "不同"    
## [151] "不便"     "不知"     "不宜"     "不負"     "不顧"     "不想"    
## [157] "匱乏"     "缺乏"     "匿乏"     "饋乏"     "窘乏"     "並不是"  
## [163] "並沒有"   "不等於"   "不合理"   "不靠譜"   "不可能"   "不瞭解"  
## [169] "不確定"   "不需要"   "不要緊"   "不一樣"   "不用說"   "不爭氣"  
## [175] "不知道"   "達不到"   "得不到"   "過不去"   "看不到"   "看不見"  
## [181] "看不清"   "來不及"   "買不起"   "沒辦法"   "沒想到"   "沒有過"  
## [187] "沒有用"   "上不去"   "受不了"   "想不到"   "用不了"   "用不<U+7740>"
## [193] "有沒有"   "找不到"   "非必需"   "非行家"   "未生效"   "從未有"  
## [199] "勿觸摸"   "非常差"   "食慾差"   "用不上"   "不像話"   "不可以"  
## [205] "不足取"   "顧不得"   "不值得"   "碰不得"   "記不住"   "成不了"  
## [211] "不入流"   "不願意"   "容不得"   "不熟悉"   "不能夠"   "不可比"  
## [217] "瞧不上"   "抗不住"   "不怎麼樣" "寸步難行" "不應該"   "不符"

主文情緒分析

  • 利用「情緒字典」計算主文中文情緒
  • 給出頻率前 20 大的情緒詞

從下面的結果得出不同載具的情緒字眼有所不同

  • 公車
    • 方便: 高雄公車一直被討論方不方便
    • 問題: 高雄公車搭乘人數較少,大家會討論問題點
    • 麻煩: 高雄公車不夠便利,大家覺得搭乘很麻煩
  • 捷運 +建設、發展: 捷運需要花費大量金額建設和發展 +方便: 同高雄公車 +問題: 通高雄公車

  • 輕軌
    • 建設、問題: 同上
    • 改善、安全: 輕軌的安全度一直備受爭議
    • 完成: 輕軌最近才完成
######## 單一斷詞
A_url_org <- noncommnet_A1 %>% select(url,month,find_word,text)  %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>% 
  count(url,month,find_word,word, sort = TRUE)  %>% 
  inner_join(CVAW3,by=c("word"="Word")) 


A_url_org %>% group_by(find_word,word) %>% 
  summarise(n=sum(n),total_Valence_Mean=sum(n*Valence_Mean),total_Arousal_Mean=sum(n*Arousal_Mean))%>% 
  mutate(size_Valence_Mean=(total_Valence_Mean>=0),size_Arousal_Mean=(total_Arousal_Mean>=0)) %>% 
  group_by(find_word,size_Valence_Mean) %>% 
  top_n(10,wt = abs(total_Valence_Mean)) %>% 
  ungroup %>% 
  mutate(word = reorder(word, total_Valence_Mean)) %>% 
  ggplot(aes(word, total_Valence_Mean, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

  • 利用「否定詞」修正 bigram 中文情緒
  • 對不同的「否定詞」給出頻率前 15 大的情緒詞
  • 目前給出出現頻率最大的兩個否定詞「不」和「沒」
# unnest_tokens 使用的bigram分詞函數
# Input: a character vector
# Output: a list of character vectors of the same length
jieba_bigram <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    bigram<- ngrams(tokens, 2)
    bigram <- lapply(bigram, paste, collapse = " ")
    unlist(bigram)
  })
}

A_url_org_bigram_dict <- noncommnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word)

A_url_org_bigram <- noncommnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(url,month,find_word,word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word) %>% 
  left_join(CVAW3,by=c("word2"="Word"))

A_url_org_bigram2=A_url_org_bigram%>% 
  group_by(url) %>%
  summarise(negation_Valence_per_url=sum(n*Valence_Mean),negation_Arousal_per_url=sum(n*Arousal_Mean)) 


tmp1=A_url_org_bigram %>% group_by(find_word,word1) %>% summarise(word1_n=n()) %>% group_by(find_word) %>% 
  top_n(2,word1_n)

A_url_org_bigram %>% left_join(tmp1,by=c("find_word","word1")) %>% na.omit %>% 
  mutate(tmp_ind=paste(find_word,word1),total_Valence_Mean=-1*(n*Valence_Mean)) %>%
  group_by(tmp_ind) %>% 
  top_n(15,abs(total_Valence_Mean)) %>% 
  ungroup() %>% 
  mutate(word2 = reorder(word2, total_Valence_Mean)) %>% 
  ggplot(aes(word2, total_Valence_Mean, fill = tmp_ind)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~tmp_ind, nrow = 2, scales="free",dir="v") + 
  coord_flip()

  • 單一斷詞情緒與 bigram 修正情緒結合
  • 計算每一篇文章的情緒值和亢奮值
  • 從下圖可得出
    • 正面文章都會以較為理性的方式撰寫 ( 有可能誇太大容易引起反彈被攻擊 )
    • 反面文章都比較激動
A_url_org2_tmp=A_url_org%>% 
  group_by(url,month,find_word) %>%
  summarise(Valence_per_url=sum(n*Valence_Mean),Arousal_per_url=sum(n*Arousal_Mean)) 

A_url_org2=A_url_org2_tmp %>% left_join(A_url_org_bigram2,by="url") %>% 
  mutate(Valence_per_url=Valence_per_url-2*ifelse(!is.na(negation_Valence_per_url),negation_Valence_per_url,0),
         Arousal_per_url=Arousal_per_url-2*ifelse(!is.na(negation_Valence_per_url),negation_Arousal_per_url,0))


tmp_A= noncommnet_A %>% select(url,title);
tmp_A2=A_url_org2 %>% left_join(tmp_A,by="url") %>% 
  mutate(hyper_link=paste0("<a href='",url,"'>","Go","</a>"));
tmp_A3=tmp_A2 %>% ungroup %>% filter(Arousal_per_url>=quantile(Arousal_per_url,0.995)|Arousal_per_url<=quantile(Arousal_per_url,0.005)|
                                       Valence_per_url>=quantile(Valence_per_url,0.995)|Valence_per_url<=quantile(Valence_per_url,0.005))

plot_ly(data=tmp_A2,x=~Valence_per_url,y=~Arousal_per_url,text = ~title,type = 'scatter',
           mode = 'markers') %>%
  add_trace(data=tmp_A3,
    x=~(Valence_per_url-1.25),
    y=~(Arousal_per_url-1.25),
    text=~hyper_link,
    mode = 'text',
    showlegend = F
  ) %>%
  layout(
    title = "ptt 每一篇文章",
    xaxis = list(title = "情緒值(越大越正面)"),
    yaxis = list(title = "亢奮值(越大越亢奮)")
    )
  • 計算每月平均每篇文章的情緒值和亢奮值
  • 雖情緒有變動,但較為亢奮的文章表示當月內文平均較激動,更值得注意
  • 這裡較為激進的月份分別為
    • 公車: 2017-05、2018-12
    • 捷運: 2018-03、2018-12
    • 輕軌: 2018-04、2018-10
A_url=A_url_org2%>% 
  group_by(month,find_word) %>% 
  summarise(Valence_url_Mean=mean(Valence_per_url),
            Arousal_url_Mean=mean(Arousal_per_url)
            )


all_month=seq(from=min(A_url$month),to=max(A_url$month),by=30) %>% floor_date("month") %>% unique()
all_month_2=all_month[seq(1,length(all_month),4)]

A_url  %>% gather(line_type,Value,-month,-find_word) %>%
  group_by(line_type,find_word) %>% mutate(max_Value=max(Value),min_Value=min(Value)) %>%
  ungroup %>% 
  ggplot()+
  geom_line(aes(x=month,y=Value,group=line_type,colour=line_type),size=1.5)+
  scale_x_date(labels = date_format("%Y/%m"),breaks = all_month_2)+
  geom_text(aes(month,Value,label=ifelse(Value==max_Value & line_type=="Valence_url_Mean",as.character(month),"")),fontface="bold",size=4,color="blue")+
  geom_text(aes(month,Value,label=ifelse(Value==min_Value & line_type=="Valence_url_Mean",as.character(month),"")),fontface="bold",size=4,color="red")+
  facet_wrap(~find_word, ncol = 1, scales = "free_y")+
  theme(axis.text.x = element_text(face="bold", color="#993333",  angle=45))

評論情緒分析

  • 利用「情緒字典」計算評論中文情緒
  • 給出頻率前 20 大的情緒詞
  • 解釋同主文
commnet_A1=commnet_A %>% mutate(time=gsub("T|Z","",time))%>% 
  mutate(time=parse_date_time(time,"%Y-%m-%d %H-%M-%S"))%>% 
  mutate(month=floor_date(time,"month") %>% as.Date()) 

######## 單一斷詞
A_commet_url_org <- commnet_A1 %>% select(url,commet_num,month,find_word,text)  %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>% 
  count(url,commet_num,month,find_word,word, sort = TRUE)  %>% 
  inner_join(CVAW3,by=c("word"="Word")) 

A_commet_url_org %>% group_by(find_word,word) %>% 
  summarise(n=sum(n),total_Valence_Mean=sum(n*Valence_Mean),total_Arousal_Mean=sum(n*Arousal_Mean))%>% 
  mutate(size_Valence_Mean=(total_Valence_Mean>=0),size_Arousal_Mean=(total_Arousal_Mean>=0)) %>% 
  group_by(find_word,size_Valence_Mean) %>% 
  top_n(10,wt = abs(total_Valence_Mean)) %>% 
  ungroup %>% 
  mutate(word = reorder(word, total_Valence_Mean)) %>% 
  ggplot(aes(word, total_Valence_Mean, fill = find_word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~find_word, nrow = 1, scales="free") + 
  coord_flip()

利用「否定詞」修正 bigram 中文情緒

######## bigram
A_commet_url_org_bigram_dict <- commnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word)

A_commet_url_org_bigram <- commnet_A1 %>%
  unnest_tokens(bigram, text, token = jieba_bigram) %>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  count(url,commet_num,month,find_word,word1, word2, sort = TRUE) %>% 
  filter(word1%in%negation_word_tmp3 & word2%in%CVAW3$Word) %>% 
  left_join(CVAW3,by=c("word2"="Word"))

tmp2=A_commet_url_org_bigram %>% group_by(find_word,word1) %>% summarise(word1_n=n()) %>% group_by(find_word) %>% 
  top_n(2,word1_n)

A_commet_url_org_bigram %>% left_join(tmp2,by=c("find_word","word1")) %>% na.omit %>% 
  mutate(tmp_ind=paste(find_word,word1),total_Valence_Mean=-1*(n*Valence_Mean)) %>%
  group_by(tmp_ind) %>% 
  top_n(15,abs(total_Valence_Mean)) %>% 
  ungroup() %>% 
  mutate(word2 = reorder(word2, total_Valence_Mean)) %>% 
  ggplot(aes(word2, total_Valence_Mean, fill = tmp_ind)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~tmp_ind, nrow = 2, scales="free",dir="v") + 
  coord_flip()

  • 單一斷詞情緒與 bigram 修正情緒結合
  • 計算每一篇文章平均每一篇留言評論情緒值和亢奮值
A_commet_url_org_bigram2_percomment=A_commet_url_org_bigram%>% 
  group_by(url,commet_num) %>%
  summarise(negation_Valence_per_comment=sum(n*Valence_Mean),negation_Arousal_per_comment=sum(n*Arousal_Mean))


A_commet_url_org2_percomment=A_commet_url_org%>% 
  group_by(url,commet_num,month,find_word) %>%
  summarise(Valence_per_comment=sum(n*Valence_Mean),Arousal_per_comment=sum(n*Arousal_Mean)) %>% 
  left_join(A_commet_url_org_bigram2_percomment,by=c("url","commet_num")) %>% 
  mutate(Valence_per_comment=Valence_per_comment-2*ifelse(!is.na(negation_Valence_per_comment),negation_Valence_per_comment,0),
         Arousal_per_comment=Arousal_per_comment-2*ifelse(!is.na(negation_Arousal_per_comment),negation_Arousal_per_comment,0));

  
A_commet_url_org2=A_commet_url_org2_percomment %>% 
  group_by(url,month,find_word) %>%
  summarise(Mean_Comment_Valence_per_url=mean(Valence_per_comment),
            Mean_Comment_Arousal_per_url=mean(Arousal_per_comment),
            observation=n()) %>% filter(observation>=5)
  • 只顯示主文和評論亢奮值較高的文章
  • 可以觀察出文章內的內文都相對較激進
  • 從大部分的文章,可以得出這些文章大部分確實是較為激進的文章和評論
A_url_comment_combine=A_url_org2 %>% ungroup %>% 
  inner_join(A_commet_url_org2,by=c("url","find_word")) %>% select(url,
                                                    Valence_per_url,Arousal_per_url,
                                                    Mean_Comment_Valence_per_url,
                                                    Mean_Comment_Arousal_per_url) %>% 
  filter(Arousal_per_url>=quantile(Arousal_per_url,0.7),
         Mean_Comment_Arousal_per_url>=quantile(Mean_Comment_Arousal_per_url,0.7)) %>% 
  mutate(hyper_link=paste0("<a href='",url,"'>","Go","</a>"));
  
tmp_A= noncommnet_A %>% select(url,title);
tmp_A2=A_url_comment_combine %>% left_join(tmp_A,by="url") %>% 
  mutate(hyper_link=paste0("<a href='",url,"'>","Go","</a>"));

plot_ly(data=tmp_A2,x=~Mean_Comment_Valence_per_url,y=~Valence_per_url,text = ~title,type = 'scatter',mode = 'markers') %>%
  add_trace(
    x=~(Mean_Comment_Valence_per_url-0.05),
    y=~(Valence_per_url-1.25),
    text=~hyper_link,
    mode = 'text',
    showlegend = F
  ) %>%
  layout(
    title = "ptt 每一篇文章",
    xaxis = list(title = "評論情緒值(越大越正面)"),
    yaxis = list(title = "正文情緒值(越大越正面)")
    )
  • 計算每月平均每篇文章平均每篇評論的情緒值和亢奮值
  • 解釋與主文同里
A_commet_url_org2=A_commet_url_org2_percomment %>% 
  group_by(url,month,find_word) %>%
  summarise(Mean_Comment_Valence_per_url=mean(Valence_per_comment),
            Mean_Comment_Arousal_per_url=mean(Arousal_per_comment),
            observation=n()) %>% filter(observation>=5)

A_commet_url=A_commet_url_org2%>% 
  group_by(month,find_word) %>% 
  summarise(Valence_url_comment_Mean=mean(Mean_Comment_Valence_per_url),
            Arousal_url_comment_Mean=mean(Mean_Comment_Arousal_per_url)
  )


all_month=seq(from=min(A_commet_url$month),to=max(A_commet_url$month),by=30) %>% floor_date("month") %>% unique()
all_month_2=all_month[seq(1,length(all_month),4)]


A_commet_url  %>% gather(line_type,Value,-month,-find_word) %>% 
  group_by(line_type,find_word) %>% mutate(max_Value=max(Value),min_Value=min(Value)) %>%
  ungroup %>% 
  #mutate(line_colour=ifelse(str_detect(line_type,"Valence"),"#CC6666", "#9999CC"))%>%
  ggplot()+
  geom_line(aes(x=month,y=Value,group=line_type,colour=line_type),size=1.5)+
  scale_x_date(labels = date_format("%Y/%m"),breaks = all_month_2)+
  geom_text(aes(month,Value,label=ifelse(Value==max_Value & line_type=="Valence_url_comment_Mean",as.character(month),"")),fontface="bold",size=4,color="blue")+
  geom_text(aes(month,Value,label=ifelse(Value==min_Value & line_type=="Valence_url_comment_Mean",as.character(month),"")),fontface="bold",size=4,color="red")+
  facet_wrap(~find_word, ncol = 1, scales = "free_y")+
  theme(axis.text.x = element_text(face="bold", color="#993333",  angle=45))

字詞 correlation 計算

Url_words <- noncommnet_A1 %>%
  unnest_tokens(word, text, token=chi_tokenizer) %>%
  count(url,find_word, word, sort = TRUE) %>% 
  filter(!(word%in%stop_words))%>% 
  filter(!str_detect(word,"^[0-9]{1,}\\.$"))


# 計算兩個詞彙間的相關性
word_cors=lapply(c("公車","捷運","輕軌"), function(tmp_find_word){
  word_cors_tmp <- Url_words %>% filter(find_word==tmp_find_word) %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  ungroup %>% 
  pairwise_cor(word, url, sort = TRUE)
  return(data.frame(find_word=tmp_find_word,word_cors_tmp))
}) %>% do.call(what="rbind")


word_cors %>% filter(find_word=="捷運") %>% 
  filter(item1 %in% c("黃線","運量","台鐵")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~item1 , scales = "free") +
  coord_flip()
## Selecting by correlation

word_cors %>% filter(find_word=="輕軌") %>% 
  filter(item1 %in% c("大順路","工程","交通","[錢]")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~item1 , scales = "free") +
  coord_flip()
## Selecting by correlation

使用網路圖呈現字詞相關性

  • 公車字詞網絡
  • 能看出討論公車免費或優惠的方案
seed_words=c("加註","板非","重覆多發","讀版","da","館藏","網頁","點定","要問","別忘了","記者")


threshold <- .65
remove_words <- word_cors %>% filter(find_word=="公車") %>%
                filter((item1 %in% seed_words|item2 %in% seed_words), correlation>threshold) %>%
                .$item1 %>%
                unique()
remove_words
##  [1] "點定"     "請參閱"   "無關"     "板非"     "重覆多發" "加註"    
##  [7] "關聯性"   "張貼"     "讀版"     "da"       "修改"     "性質"    
## [13] "館藏"     "要問"     "別忘了"   "附上"     "心情"     "網頁"    
## [19] "討論"     "高雄市民" "市民"     "報導"     "記者"     "開放"    
## [25] "方式"     "日期"
word_cors_new=word_cors %>% filter(find_word=="公車") %>% select(-find_word) %>% 
  filter((!(item1 %in% remove_words|item2 %in% remove_words))) 
word_cors_new %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

  • 捷運字詞網絡
  • 主要有討論黃線的建設、地下化、公車和成本等等
  • 討論黃線的文章,大部分也會討論紅線延伸路竹與台鐵路竹重疊率高
word_cors_new=word_cors %>% filter(find_word=="捷運") %>% select(-find_word) %>% 
  filter((!(item1 %in% remove_words|item2 %in% remove_words))) 
word_cors_new %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

  • 輕軌字詞網絡
  • 輕軌主要討論建設和一些輕軌站點
word_cors_new=word_cors %>% filter(find_word=="輕軌") %>% select(-find_word) %>% 
  filter((!(item1 %in% remove_words|item2 %in% remove_words))) 
word_cors_new %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

文章-主題分析

捷運主體分析

  • 因為公車本身主題不明顯,故主題分析只討論捷運和輕軌
  • 捷運主體主要分成
    • 高捷和北捷運量上的差異
    • 黃線的規劃討論

ϕ matrix (topic * term)

#根據每一篇文章的Url給定一個id。
tmp_find_word="捷運"

reserved_word <- Url_words %>% 
  group_by(find_word,word) %>% 
  count() %>% 
  filter(n > 5) %>% ungroup %>% 
  filter(find_word==tmp_find_word) %>% .$word

Url_words_2 <- Url_words %>%
  mutate(artId = group_indices(., url)) %>% 
  filter(find_word==tmp_find_word,word%in%reserved_word)

Url_words_dtm <-Url_words_2 %>% cast_dtm(artId, word, n)
Url_words_lda <- LDA(Url_words_dtm, k =2, control = list(seed = 2000))


#從中可以得到特定主題生成特定詞彙的概率。
Url_words_topics <- tidy(Url_words_lda, matrix = "beta") # 注意,在tidy function裡面要使用"beta"來取出Phi矩陣。

Url_words_top_terms <- Url_words_topics %>%
  filter(! term %in% stop_words) %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


Url_words_top_terms %>% ungroup %>% 
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

θ matrix (document * topic)

#θ matrix (document * topic)
Url_words_documents <- tidy(Url_words_lda, matrix="gamma") # 在tidy function中使用參數"gamma"來取得 theta矩陣。
Url_words_documents
## # A tibble: 1,122 x 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 592          1 0.999   
##  2 727          1 0.000509
##  3 759          1 0.111   
##  4 619          1 0.00170 
##  5 1368         1 0.000587
##  6 290          1 0.481   
##  7 886          1 0.00135 
##  8 260          1 0.999   
##  9 693          1 0.0857  
## 10 1149         1 0.117   
## # ... with 1,112 more rows
#gamma值代表的是這篇文章中有多少比例的詞是出自於特定topic

簡單的主題分類

  • 從標題可以簡單看出確實和我們猜測的主題相符
Url_words_2$artId <- as.character(Url_words_2$artId)

Url_words_documents%>%
  group_by(topic) %>%
  top_n(10, wt=gamma) %>%
  inner_join(Url_words_2, by = c("document" = "artId")) %>%
  distinct(url) %>%
  inner_join(noncommnet_A, by =c("url"="url")) %>%
  select(topic, title) %>% DT::datatable()
  • 按照θ矩陣的“gamma”值劃分兩種主題的資料
Url_words_documents_spread <- Url_words_documents %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, gamma)

half_num = round(nrow(Url_words_documents_spread)/2) # 原始文章數量的一半

topic1_id <- Url_words_documents_spread %>%   # 取出topic_1最高的half_num篇文章
  top_n(half_num, topic1) %>%
  select(document) %>%
  unlist()

topic2_id <- Url_words_documents_spread$document %>%
  setdiff(topic1_id)
  • Topic 1
#install.packages("wordcloud2")
require(wordcloud2)
word_cloud_topic_1 <- Url_words_2 %>%
  filter(artId %in% topic1_id) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_1, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc7.png", delay=10)
  • Topic 2
word_cloud_topic_2 <- Url_words_2 %>%
  filter(artId %in% topic2_id  ) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_2, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc8.png", delay=10)

輕軌主體分析

  • 輕軌主題主要分成
    • 輕軌經過大順路會塞車,希望高架或地下化
    • 輕軌的營運、建設和工程情況 ### ϕ matrix (topic * term)
#根據每一篇文章的Url給定一個id。
tmp_find_word="輕軌"

reserved_word <- Url_words %>% 
  group_by(find_word,word) %>% 
  count() %>% 
  filter(n > 5) %>% ungroup %>% 
  filter(find_word==tmp_find_word) %>% .$word

Url_words_2 <- Url_words %>%
  mutate(artId = group_indices(., url)) %>% 
  filter(find_word==tmp_find_word,word%in%reserved_word)

Url_words_dtm <-Url_words_2 %>% cast_dtm(artId, word, n)
Url_words_lda <- LDA(Url_words_dtm, k =2, control = list(seed = 2000))


#從中可以得到特定主題生成特定詞彙的概率。
Url_words_topics <- tidy(Url_words_lda, matrix = "beta") # 注意,在tidy function裡面要使用"beta"來取出Phi矩陣。

Url_words_top_terms <- Url_words_topics %>%
  filter(! term %in% stop_words) %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


Url_words_top_terms %>% ungroup %>% 
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

θ matrix (document * topic)

#θ matrix (document * topic)
Url_words_documents <- tidy(Url_words_lda, matrix="gamma") # 在tidy function中使用參數"gamma"來取得 theta矩陣。
Url_words_documents
## # A tibble: 1,116 x 3
##    document topic   gamma
##    <chr>    <int>   <dbl>
##  1 614          1 1.000  
##  2 1078         1 0.789  
##  3 350          1 0.372  
##  4 915          1 0.159  
##  5 381          1 0.0387 
##  6 1149         1 0.725  
##  7 977          1 0.832  
##  8 1175         1 0.999  
##  9 80           1 0.00846
## 10 446          1 0.00170
## # ... with 1,106 more rows
#gamma值代表的是這篇文章中有多少比例的詞是出自於特定topic

簡單的主題分類

Url_words_2$artId <- as.character(Url_words_2$artId)

Url_words_documents%>%
  group_by(topic) %>%
  top_n(10, wt=gamma) %>%
  inner_join(Url_words_2, by = c("document" = "artId")) %>%
  distinct(url) %>%
  inner_join(noncommnet_A, by =c("url"="url")) %>%
  select(topic, title) %>% DT::datatable()
  • 按照θ矩陣的“gamma”值劃分兩種主題的資料
Url_words_documents_spread <- Url_words_documents %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, gamma)

half_num = round(nrow(Url_words_documents_spread)/2) # 原始文章數量的一半

topic1_id <- Url_words_documents_spread %>%   # 取出topic_1最高的half_num篇文章
  top_n(half_num, topic1) %>%
  select(document) %>%
  unlist()

topic2_id <- Url_words_documents_spread$document %>%
  setdiff(topic1_id)
  • Topic 1
#install.packages("wordcloud2")
require(wordcloud2)
word_cloud_topic_1 <- Url_words_2 %>%
  filter(artId %in% topic1_id) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_1, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc9.png", delay=10)
  • Topic 2
word_cloud_topic_2 <- Url_words_2 %>%
  filter(artId %in% topic2_id  ) %>%
  group_by(word) %>%
  summarise(sum = sum(n)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

# saveWidget(word_cloud_topic_2, "word_cloud_topic.html", selfcontained = F)
# img <- webshot("word_cloud_topic.html", "./graph/wc10.png", delay=10)

結論