2020年因強烈的太平洋副熱帶氣壓,台灣創下56年來首度沒有颱風登入的紀錄。這也導致水庫蓄水量逐漸不足,全台面臨旱災缺水的危機。在水利署5月25日發出的水情燈號中,已經有許多縣市進入減量供水的燈色燈號,導致各地區進入不同程度的減壓供水、限水、停耕、歇業等情況,這是在1947年以來最嚴重的一次乾旱,但是這波旱象在今年的5月底至6月初幾波梅雨鋒面帶來明顯的降雨而初步緩解。
我們將3、4月與5、6月進行切割,探討在降雨的前後PTT版上相關討論的發文風向,主要針對以下方向進行分析:
Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼## [1] ""
packages = c("readr", "dplyr", "jiebaR", 'scales', 'tm', 'purrr', "tidyr", "tidytext", "igraph", "topicmodels", "ggplot2", "stringr","networkD3")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)library(readr)
library(dplyr)
library(jiebaR)
library(tidyr)
library(tidytext)
library(igraph)
library(topicmodels)
library(stringr)
library(ggplot2)
library(reshape2)
library(scales)
library(tm)
library(purrr)
library(wordcloud2)post <- read.csv("PTT下雨_articleMetaData.csv")
review <- read.csv("PTT下雨_articleReviews.csv")# 發文數量的變化
post %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate) %>%
summarise(count = n())%>%
ggplot(aes(artDate,count))+
geom_line(color="red")+
geom_point()兩大討論高峰大約是在四月中旬的時候與五月中旬的時候,前者屬乾旱時期的討論,後者屬降雨過後的討論。那在五月的下旬這一段討論的聲量也算是偏高,推測可能與梅雨鋒面有關。
# 處理一下sentence的特殊符號,以及常出現的格式用字
post <- post %>%
mutate(sentence=gsub("[\n]{2,}", "。", sentence)) %>%
mutate(sentence=gsub("\n", "", sentence)) %>%
mutate(sentence=gsub("http(s)?[-:\\/A-Za-z0-9\\.]+", " ", sentence))%>%
mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", sentence))
review <- review %>%
mutate(cmtContent=gsub("[\n]{2,}", "。", cmtContent)) %>%
mutate(cmtContent=gsub("\n", "", cmtContent)) %>%
mutate(cmtContent=gsub("http(s)?[-:\\/A-Za-z0-9\\.]+", " ", cmtContent))%>%
mutate(cmtContent=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", cmtContent))
# 挑選文章對應的留言
review = left_join(post, review[,c("artUrl", "cmtContent")], by = "artUrl")# 加入自定義的字典
jieba_tokenizer <- worker(user="dict/user_dict.txt", stop_word = "dict/stop_words.txt")
# 設定斷詞function
customized_tokenizer <- function(t) {
lapply(t, function(x) {
tokens <- segment(x, jieba_tokenizer)
return(tokens)
})
}
# 把文章和留言的斷詞結果併在一起
MToken <- post %>% unnest_tokens(word, sentence, token=customized_tokenizer)
RToken <- review %>% unnest_tokens(word, cmtContent, token=customized_tokenizer)
# 把資料併在一起
data <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl", "word")])
head(data)## artDate artUrl word
## 1 2021/3/1 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html 魯蛇
## 2 2021/3/1 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html 家鄉
## 3 2021/3/1 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html 最近
## 4 2021/3/1 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html 停水
## 5 2021/3/1 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html 好
## 6 2021/3/1 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html 幾天
# 格式化日期欄位
data$artDate= data$artDate %>% as.Date("%Y/%m/%d")
# 過濾特殊字元
data_select = data %>%
filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
filter(nchar(.$word)>1)
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select %>%
select(artDate,word) %>%
group_by(artDate,word) %>%
summarise(count=n()) %>% # 算字詞單篇總數用summarise
filter(count>3) %>% # 過濾出現太少次的字
arrange(desc(count))## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
P <- read_file("liwc/positive.txt") # 正向字典txt檔
N <- read_file("liwc/negative.txt") # 負向字典txt檔
#字典txt檔讀進來是一整個字串
typeof(P)## [1] "character"
# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]
# 建立dataframe 有兩個欄位word,sentiments,word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047
# 把兩個字典拼在一起
LIWC = rbind(P, N)
# 檢視字典
head(LIWC)## word sentiment
## 1 一流 positive
## 2 下定決心 positive
## 3 不拘小節 positive
## 4 不費力 positive
## 5 不錯 positive
## 6 主動 positive
# 算出每天的情緒總和
# sentiment_count:artDate,sentiment,count
sentiment_count = data_select %>%
select(artDate,word) %>%
inner_join(LIWC) %>%
group_by(artDate,sentiment) %>%
summarise(count=n()) ## Joining, by = "word"
## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
# 3、4月正負情緒分數折線圖
sentiment_count %>%
ggplot()+
geom_line(aes(x=artDate,y=count,colour=sentiment))+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2021-03-01','2021-04-30'))
)## Warning: Removed 72 row(s) containing missing values (geom_path).
折線圖中可以看到3、4月這段時間當中正負面情緒是差不多的,在三月初到四月中旬幾乎都是負面情緒大於正面情緒,而在四月底則出現了正面情緒大於負面情緒。
# 5、6月正負情緒分數折線圖
sentiment_count %>%
ggplot()+
geom_line(aes(x=artDate,y=count,colour=sentiment))+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2021-05-01','2021-06-07'))
)+
# # 加上標示日期的線
geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-05-24'))
[1]])),colour = "#eb175b", linetype="dotdash")## Warning: Removed 120 row(s) containing missing values (geom_path).
而在5、6月的情緒折線圖中,可以看到五月中旬以前負面的情緒突然間拉大許多,而在圖中標示的紅色線段5/24突然正面情緒大於負面情緒,接著後面又回復到正負面情緒差不多的情況。
M34 <- post %>%
filter(artDate <= as.Date('2021-04-30'))
M56 <- post %>%
filter(between(artDate, as.Date('2021-05-01'), as.Date('2021-06-07')))## Warning in between(artDate, as.Date("2021-05-01"), as.Date("2021-06-07")): 強制
## 變更過程中產生了 NA
# # 文章斷句("\n\n"取代成"。")
# mask_meta <- M34 %>%
# mutate(sentence=gsub("[\n]{2,}", "。", sentence))
#
# # 以全形或半形 驚歎號、問號、分號 以及 全形句號 爲依據進行斷句
# mask_sentences <- strsplit(mask_meta$sentence,"[。!;?!?;]")
#
# # 將每句句子,與他所屬的文章連結配對起來,整理成一個dataframe
# mask_sentences <- data.frame(
# artUrl = rep(mask_meta$artUrl, sapply(mask_sentences, length)),
# sentence = unlist(mask_sentences)
# ) %>%
# filter(!str_detect(sentence, regex("^(\t|\n| )*$")))
# # 如果有\t或\n就去掉
#
# mask_sentences$sentence <- as.character(mask_sentences$sentence)
# mask_sentences# ## 文章斷詞
# # load mask_lexicon(特定要斷開的詞,像是user_dict)
# mask_lexicon <- scan(file = "dict/mask_lexicon.txt", what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
# # load stop words
# stop_words <- scan(file = "dict/stop_words.txt", what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
#
# # 使用默認參數初始化一個斷詞引擎
# jieba_tokenizer = worker()
#
# # 使用口罩字典重新斷詞
# new_user_word(jieba_tokenizer, c(mask_lexicon))
#
# # tokenize function
# chi_tokenizer <- function(t) {
# lapply(t, function(x) {
# if(nchar(x)>1){
# tokens <- segment(x, jieba_tokenizer)
# tokens <- tokens[!tokens %in% stop_words]
# # 去掉字串長度爲1的詞彙
# tokens <- tokens[nchar(tokens)>1]
# return(tokens)
# }
# })
# }
#
#
# # 用剛剛初始化的斷詞器把sentence斷開
# tokens <- mask_sentences %>%
# mutate(sentence = gsub("[[:punct:]]", "",sentence)) %>%
# mutate(sentence = gsub("[0-9a-zA-Z]", "",sentence)) %>%
# unnest_tokens(word, sentence, token=chi_tokenizer) %>%
# count(artUrl, word) %>% # 計算每篇文章出現的字頻
# rename(count=n)
# tokens
# save.image(file = "token_result1.rdata")load("token_result.rdata")
# load("token_result2.rdata")。根據詞頻,選擇只出現3字以上的字 。整理成url,word,n的格式之後,就可以轉dtm
freq = 3
# 依據字頻挑字
reserved_word <- tokens %>%
group_by(word) %>%
count() %>%
filter(n > freq) %>%
unlist()
water_removed <- tokens %>%
filter(word %in% reserved_word)
#water_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
water_dtm <- water_removed %>% cast_dtm(artUrl, word, count) 嘗試2、3、4、6、10、15個主題數,將結果存起來,再做進一步分析。 已經將跑完的檔案存成ldas_result.rdata,可以直接載入
# ldas = c()
# topics = c(2,3,4,6,10,15)
# for(topic in topics){
# start_time <- Sys.time()
# lda <- LDA(water_dtm, k = topic, control = list(seed = 2021))
# ldas =c(ldas,lda)
# print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
# save(ldas,file = "ldas_result.rdata") # 將模型輸出成檔案
# }load("ldas_result.rdata")topics = c(2,3,4,6,10,15)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
將剛處理好的dtm放入LDA函式分析
# LDA分成3個主題
water_lda1 <- LDA(water_dtm, k = 3, control = list(seed = 15))removed_word = c('不是', '大家', '沒有')
# 看各群的常用詞彙
tidy(water_lda1, matrix = "beta") %>% # 取出topic term beta值
filter(! term %in% removed_word) %>%
group_by(topic) %>%
top_n(10, beta) %>% # beta值前10的字
ungroup() %>%
mutate(topic = as.factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()從topic中的代表詞可以歸納出:
topic 1 = “台灣水庫的用水量”
topic 2 = “討論台灣各地的缺水問題”
topic 3 = “與水情相關的新聞報導”
topics_name = c("台灣水庫的用水量","討論台灣各地的缺水問題","與水情相關的新聞報導")每篇文章拿gamma值最大的topic當該文章的topic
# 在tidy function中使用參數"gamma"來取得 theta矩陣
water_topic1 <- tidy(water_lda1, matrix="gamma") %>% # document topic gamma
group_by(document) %>%
top_n(1, wt=gamma)
post_topic1 <- merge(x = post, y = water_topic1, by.x = "artUrl", by.y="document")post_topic1 %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate,topic) %>%
summarise(sum =sum(topic)) %>%
ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
geom_col(position="fill") ## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
post_topic1 %>%
group_by(artCat,topic) %>%
summarise(sum = n()) %>%
ggplot(aes(x= artCat,y=sum,fill=as.factor(topic))) +
geom_col(position="dodge") ## `summarise()` has grouped output by 'artCat'. You can override using the `.groups` argument.
畫出每天topic的分布,可以發現第二個主題(討論各地區缺水問題)的討論度較其他兩者都高,推測是因為這段期間缺水非常嚴重造成大家熱烈的討論。
# load("token_result.rdata")
load("token_result2.rdata")。根據詞頻,選擇只出現3字以上的字 。整理成url,word,n的格式之後,就可以轉dtm
freq = 3
# 依據字頻挑字
reserved_word <- tokens %>%
group_by(word) %>%
count() %>%
filter(n > freq) %>%
unlist()
water_removed <- tokens %>%
filter(word %in% reserved_word)
#water_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
water_dtm <- water_removed %>% cast_dtm(artUrl, word, count) 嘗試2、3、4、6、10、15個主題數,將結果存起來,再做進一步分析。 將跑完的檔案存成ldas_result.rdata,可以直接載入
# ldas = c()
# topics = c(2,3,4,6,10,15)
# for(topic in topics){
# start_time <- Sys.time()
# lda <- LDA(water_dtm, k = topic, control = list(seed = 2021))
# ldas =c(ldas,lda)
# print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
# save(ldas,file = "ldas_result2.rdata") # 將模型輸出成檔案
# }load("ldas_result2.rdata")topics = c(2,3,4,6,10,15)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")選定4個主題數建立5、6月的主題模型
將剛處理好的dtm放入LDA函式分析
# LDA分成4個主題
water_lda2 <- LDA(water_dtm, k = 4, control = list(seed = 15))removed_word = c('如題', '不會', '不是', '大家', '沒有', '看到')
# 看各群的常用詞彙
tidy(water_lda2, matrix = "beta") %>% # 取出topic term beta值
filter(! term %in% removed_word) %>%
group_by(topic) %>%
top_n(10, beta) %>% # beta值前10的字
ungroup() %>%
mutate(topic = as.factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()從topic中的代表詞可以歸納出:
topic 1 = “各水庫水情狀況”
topic 2 = “與水情、水庫相關的新聞報導”
topic 3 = “討論台灣各地的水情問題”
topic 4 = “討論政府對於近期疫情、停電與水情的作為”
topics_name2 = c("各水庫水情狀況","與水情、水庫相關的新聞報導","討論台灣各地的水情問題","討論政府對於近期疫情、停電與水情的作為")每篇文章拿gamma值最大的topic當該文章的topic
# 在tidy function中使用參數"gamma"來取得 theta矩陣
water_topic2 <- tidy(water_lda2, matrix="gamma") %>% # document topic gamma
group_by(document) %>%
top_n(1, wt=gamma)
post_topic2 <- merge(x = post, y = water_topic2, by.x = "artUrl", by.y="document")post_topic2 %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate,topic) %>%
summarise(sum =sum(topic)) %>%
ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
geom_col(position="fill") ## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
post_topic2 %>%
group_by(artCat,topic) %>%
summarise(sum = n()) %>%
ggplot(aes(x= artCat,y=sum,fill=as.factor(topic))) +
geom_col(position="dodge") ## `summarise()` has grouped output by 'artCat'. You can override using the `.groups` argument.
畫出每天topic的分布,可以發現第三、四個主題(討論台灣各地的水情問題、討論政府對於近期疫情、停電與水情的作為)的討論度較其他兩者都高,推測是因為雖然在4、5月有降雨使水情好轉,但是搭配疫情和限電的恐慌仍然造成了較高的討論度。
post <- read.csv("PTT下雨_articleMetaData.csv")
review <- read.csv("PTT下雨_articleReviews.csv")# 文章和留言
review <- review %>%
select(artUrl, cmtPoster, cmtStatus, cmtContent)
post_Review <- merge(x = post, y = review, by = "artUrl")
# 把文章和topic
post_Review <- merge(x = post_Review, y = water_topic1, by.x = "artUrl", by.y="document")
head(post_Review,3)## artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html
## artTitle artDate artTime artPoster
## 1 [問卦]台灣水情吃緊跟鳳梨外銷中國誰比較重要! 2021/3/1 13:45:31 wan5389
## 2 [問卦]台灣水情吃緊跟鳳梨外銷中國誰比較重要! 2021/3/1 13:45:31 wan5389
## 3 [問卦]台灣水情吃緊跟鳳梨外銷中國誰比較重要! 2021/3/1 13:45:31 wan5389
## artCat commentNum push boo
## 1 Gossiping 27 15 2
## 2 Gossiping 27 15 2
## 3 Gossiping 27 15 2
## sentence
## 1 魯蛇家鄉最近停水好幾天引起民怨!\n就查了最近的水情資料!\nhttps://i.imgur.com/rUN53pK.jpg\n發現最近新竹以南水情不是很樂觀!\n卻沒有中央官員出來呼籲要怎麼解決缺水問題!\n反觀\n鳳梨被中國禁止進口,\nhttps://i.imgur.com/zXsr4Vy.jpg\n中央從總統到行政院長各級長官,\n都站出來呼籲解決鳳梨問題!\n\n所以現在臺灣鳳梨比臺灣缺水誰重要?\n\n沒有喝水就喝鳳梨汁是不錯的解決方案吧?\n是否有專板\n本板並非萬能問板\n兩則\n本看板嚴格禁止政治問卦\n未滿30繁體中文字水桶3個月,嚴重者以鬧板論
## 2 魯蛇家鄉最近停水好幾天引起民怨!\n就查了最近的水情資料!\nhttps://i.imgur.com/rUN53pK.jpg\n發現最近新竹以南水情不是很樂觀!\n卻沒有中央官員出來呼籲要怎麼解決缺水問題!\n反觀\n鳳梨被中國禁止進口,\nhttps://i.imgur.com/zXsr4Vy.jpg\n中央從總統到行政院長各級長官,\n都站出來呼籲解決鳳梨問題!\n\n所以現在臺灣鳳梨比臺灣缺水誰重要?\n\n沒有喝水就喝鳳梨汁是不錯的解決方案吧?\n是否有專板\n本板並非萬能問板\n兩則\n本看板嚴格禁止政治問卦\n未滿30繁體中文字水桶3個月,嚴重者以鬧板論
## 3 魯蛇家鄉最近停水好幾天引起民怨!\n就查了最近的水情資料!\nhttps://i.imgur.com/rUN53pK.jpg\n發現最近新竹以南水情不是很樂觀!\n卻沒有中央官員出來呼籲要怎麼解決缺水問題!\n反觀\n鳳梨被中國禁止進口,\nhttps://i.imgur.com/zXsr4Vy.jpg\n中央從總統到行政院長各級長官,\n都站出來呼籲解決鳳梨問題!\n\n所以現在臺灣鳳梨比臺灣缺水誰重要?\n\n沒有喝水就喝鳳梨汁是不錯的解決方案吧?\n是否有專板\n本板並非萬能問板\n兩則\n本看板嚴格禁止政治問卦\n未滿30繁體中文字水桶3個月,嚴重者以鬧板論
## cmtPoster cmtStatus cmtContent topic
## 1 kight 推 :水問題很大,但能怎麼辦?招喚水神龍出來降雨嗎? 3
## 2 ytcytc 推 :鳳梨過多可以送去做鳳梨酒阿 3
## 3 grayoasis → :綠畜帶的風向啊 3
## gamma
## 1 0.5664226
## 2 0.5664226
## 3 0.5664226
link <- post_Review %>% select(cmtPoster, artPoster, artUrl)
head(link,3)## cmtPoster artPoster artUrl
## 1 kight wan5389 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html
## 2 ytcytc wan5389 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html
## 3 grayoasis wan5389 https://www.ptt.cc/bbs/Gossiping/M.1614606334.A.2DD.html
reviewNetwork <- graph_from_data_frame(d=link, directed=T)
reviewNetwork## IGRAPH a20a462 DN-- 9774 31219 --
## + attr: name (v/c), artUrl (e/c)
## + edges from a20a462 (vertex names):
## [1] kight ->wan5389 ytcytc ->wan5389 grayoasis ->wan5389
## [4] fp737 ->wan5389 newsyho ->wan5389 rave760422 ->wan5389
## [7] loveadu ->wan5389 alex00089 ->wan5389 Pittsburgh ->wan5389
## [10] Subliminal ->wan5389 Gallardo ->wan5389 laihom0808 ->wan5389
## [13] milk7054 ->wan5389 lyzn ->wan5389 kinghtt ->wan5389
## [16] bathilda ->wan5389 koehie ->wan5389 riker729 ->wan5389
## [19] riker729 ->wan5389 koehie ->wan5389 bathilda ->wan5389
## [22] niko0202 ->wan5389 eett811025 ->wan5389 Subliminal ->wan5389
## + ... omitted several edges
直接畫的話,因為點沒有經過篩選,看起來會密密麻麻的。所以這邊會先經過篩選之後再呈現。
資料篩選的方式:
# 看一下留言數大概都多少(方便後面篩選)
M34 %>%
# filter(commentNum<100) %>%
ggplot(aes(x=commentNum)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
發現留言次數幾乎都是集中在100則以下。
length(unique(post_Review$artPoster)) # 發文者數量 605## [1] 605
length(unique(post_Review$cmtPoster)) # 回覆者數量 9469## [1] 9469
allPoster <- c(post_Review$artPoster, post_Review$cmtPoster) # 總參與人數 9774
length(unique(allPoster))## [1] 9774
標記所有出現過得使用者
userList <- data.frame(user=unique(allPoster)) %>%
mutate(type=ifelse(user%in%posts$artPoster, "poster", "replyer"))
head(userList,3)## user type
## 1 wan5389 poster
## 2 q6534gas1793 poster
## 3 A80211ab poster
事件是04/13附近討論度最高,我們挑出當天的文章和回覆看看
link <- post_Review %>%
group_by(cmtPoster, artUrl) %>%
filter(n()>1) %>%
filter(commentNum > 50) %>%
filter(artCat=="Gossiping") %>%
filter(artDate == as.Date('2021-04-13')) %>%
select(cmtPoster, artPoster, artUrl) %>%
unique()
link## # A tibble: 55 x 3
## # Groups: cmtPoster, artUrl [55]
## cmtPoster artPoster artUrl
## <chr> <chr> <chr>
## 1 StylishTrade olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 2 justice2008 olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 3 Lailungsheng olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 4 s8800892000 olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 5 Nevhir olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 6 waijr olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 7 piyobearman olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 8 turbomons olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 9 lulocke olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## 10 masoho olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A.B08.h~
## # ... with 45 more rows
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
head(filtered_user,3)## user type
## 1 kight replyer
## 2 koehie replyer
## 3 jay0215 replyer
用使用者的身份來區分點的顏色
filter_degree = 10
set.seed(123)
# 設定 node 的 label/ color
reviewNetwork = degree(reviewNetwork) > 2
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
labels <- degree(reviewNetwork) # 算出每個點的degree
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
plot(
reviewNetwork,
vertex.size=3,
edge.width=3,
vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)我們可以看到基本的使用者關係,但是我們希望能夠將更進階的資訊視覺化。
挑選出2021-04-13當天的文章, 篩選一篇文章回覆1次以上者,且文章留言數多於50則, 文章主題歸類為1(台灣水庫的用水量)與2(討論台灣各地的缺水問題)者, 欄位只取:cmtPoster(評論者), artPoster(發文者), artUrl(文章連結), topic(主題)
link <- post_Review %>%
group_by(cmtPoster, artUrl) %>%
filter(n()>1) %>%
filter(commentNum > 50) %>%
filter(artCat=="Gossiping") %>% #HatePolitics / Gossiping
filter(artDate == as.Date('2021-04-13')) %>%
filter(topic == 1 | topic == 2) %>%
select(cmtPoster, artPoster, artUrl, topic) %>%
unique()
link## # A tibble: 55 x 4
## # Groups: cmtPoster, artUrl [55]
## cmtPoster artPoster artUrl topic
## <chr> <chr> <chr> <int>
## 1 StylishTrade olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 2 justice2008 olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 3 Lailungsheng olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 4 s8800892000 olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 5 Nevhir olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 6 waijr olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 7 piyobearman olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 8 turbomons olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 9 lulocke olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## 10 masoho olmtw https://www.ptt.cc/bbs/Gossiping/M.1618291171.A~ 1
## # ... with 45 more rows
抓nodes
在所有的使用者裡面,篩選link中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
head(filtered_user,3)## user type
## 1 kight replyer
## 2 koehie replyer
## 3 jay0215 replyer
filter_degree = 10
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$topic == "2", "palevioletred", "lightgreen")
# 畫出社群網路圖(degree>7的才畫出來)
set.seed(5432)
plot(reviewNetwork, vertex.size=3, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("台灣水庫的用水量","討論台灣各地的缺水問題"),
col=c("palevioletred", "lightgreen"), lty=1, cex=0.7)我們可以看到基本的發文者與回覆者之間的關係,並發現與第二個主題’討論台灣各地的缺水問題’中的文章像是缺水如何影響台積電、水庫賣水是否再接下來會導致缺水問題浮現都非常符合結果,那與台灣水庫的用水量相關的那篇文章的內容是有點嘲諷口吻的像是’不是已經辦過祈雨法說會了?怎麼還沒下雨’,不太符合主題內容但本組認為是因為它的內容有趣導致多人回應互動才會浮現。
filter_degree = 7 # 使用者degree
# 過濾留言者對發文者的推噓程度
link <- post_Review %>%
filter(artCat=="Gossiping") %>%
filter(commentNum > 100) %>%
filter(cmtStatus!="→") %>%
group_by(cmtPoster, artUrl) %>%
filter( n() > 2) %>%
ungroup() %>%
select(cmtPoster, artPoster, artUrl, cmtStatus) %>%
unique()
# 接下來把網路圖畫出來,跟前面做的事都一樣,因此不再細述
# 篩選link中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$cmtStatus == "推", "lightgreen", "palevioletred")
# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=2, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("推","噓"),
col=c("lightgreen","palevioletred"), lty=1, cex=1)基本上推多於噓,一一檢視文章後發現 - > 三篇文章回覆多為諷刺或是按推文旦內容以嘲諷居多,所以可推斷其實主要情緒是負向的
# 文章和留言
review <- review %>%
select(artUrl, cmtPoster, cmtStatus, cmtContent)
post_Review <- merge(x = post, y = review, by = "artUrl")
# 把文章和topic
post_Review <- merge(x = post_Review, y = water_topic2, by.x = "artUrl", by.y="document")
head(post_Review,3)## artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1619938689.A.B41.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1619938689.A.B41.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1619938689.A.B41.html
## artTitle artDate artTime artPoster artCat
## 1 [問卦]缺水是不是安全下莊了! 2021/5/2 06:58:07 mikepopkimo Gossiping
## 2 [問卦]缺水是不是安全下莊了! 2021/5/2 06:58:07 mikepopkimo Gossiping
## 3 [問卦]缺水是不是安全下莊了! 2021/5/2 06:58:07 mikepopkimo Gossiping
## commentNum push boo
## 1 13 6 0
## 2 13 6 0
## 3 13 6 0
## sentence
## 1 這幾天綠色染黑染毒,新冠又炸裂,這些信息流來得又快又多,讓呆丸郎應接不暇,無法\n喘息,想到前幾天多多少少還有下點雨,這缺水問題也該pass了吧!\n
## 2 這幾天綠色染黑染毒,新冠又炸裂,這些信息流來得又快又多,讓呆丸郎應接不暇,無法\n喘息,想到前幾天多多少少還有下點雨,這缺水問題也該pass了吧!\n
## 3 這幾天綠色染黑染毒,新冠又炸裂,這些信息流來得又快又多,讓呆丸郎應接不暇,無法\n喘息,想到前幾天多多少少還有下點雨,這缺水問題也該pass了吧!\n
## cmtPoster cmtStatus cmtContent topic gamma
## 1 callhek → :台灣哪裡缺水 3 0.9547329
## 2 ting701 推 :缺肥水 3 0.9547329
## 3 a0953781935 推 :我台北人,南部怎樣乾我屁事 3 0.9547329
link <- post_Review %>% select(cmtPoster, artPoster, artUrl)
head(link,3)## cmtPoster artPoster
## 1 callhek mikepopkimo
## 2 ting701 mikepopkimo
## 3 a0953781935 mikepopkimo
## artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1619938689.A.B41.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1619938689.A.B41.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1619938689.A.B41.html
reviewNetwork <- graph_from_data_frame(d=link, directed=T)
reviewNetwork## IGRAPH a47c1cb DN-- 10148 24581 --
## + attr: name (v/c), artUrl (e/c)
## + edges from a47c1cb (vertex names):
## [1] callhek ->mikepopkimo ting701 ->mikepopkimo a0953781935->mikepopkimo
## [4] freeclouds ->mikepopkimo mirza ->mikepopkimo mirza ->mikepopkimo
## [7] ilove640 ->mikepopkimo neoa01 ->mikepopkimo cucusow ->mikepopkimo
## [10] cucusow ->mikepopkimo empoluvvivi->mikepopkimo cat5672 ->mikepopkimo
## [13] kkwwee ->mikepopkimo xxx60133 ->mystage winit ->mystage
## [16] kcclasaki ->mystage a1005100 ->mystage King5566 ->mystage
## [19] kichyo ->mystage cwh0105 ->mystage Ed860227 ->mystage
## [22] ev331 ->mystage cat5672 ->mystage ev331 ->mystage
## + ... omitted several edges
直接畫的話,因為點沒有經過篩選,看起來會密密麻麻的。所以這邊會先經過篩選之後再呈現。
資料篩選的方式:
# 看一下留言數大概都多少(方便後面篩選)
M56 %>%
# filter(commentNum<100) %>%
ggplot(aes(x=commentNum)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 38 rows containing non-finite values (stat_bin).
發現留言次數幾乎都是集中在100則以下。
length(unique(post_Review$artPoster)) # 發文者數量 605## [1] 537
length(unique(post_Review$cmtPoster)) # 回覆者數量 9469## [1] 9866
allPoster <- c(post_Review$artPoster, post_Review$cmtPoster) # 總參與人數 9774
length(unique(allPoster))## [1] 10148
標記所有出現過得使用者
userList <- data.frame(user=unique(allPoster)) %>%
mutate(type=ifelse(user%in%posts$artPoster, "poster", "replyer"))
head(userList,3)## user type
## 1 mikepopkimo poster
## 2 mystage poster
## 3 TWOOOOOOOOOO poster
事件是5/13附近討論度最高,我們挑出當天的文章和回覆看看
link <- post_Review %>%
group_by(cmtPoster, artUrl) %>%
filter(n()>1) %>%
filter(commentNum > 50) %>%
filter(artCat=="Gossiping") %>%
filter(artDate == as.Date('2021-05-13')) %>%
select(cmtPoster, artPoster, artUrl) %>%
unique()
link## # A tibble: 62 x 3
## # Groups: cmtPoster, artUrl [62]
## cmtPoster artPoster artUrl
## <chr> <chr> <chr>
## 1 bee5408 joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862.A.D64~
## 2 ayabehaori joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862.A.D64~
## 3 vvvvaaaa joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862.A.D64~
## 4 pass9487 joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862.A.D64~
## 5 kimpink joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862.A.D64~
## 6 a1121210 tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158.A.EA1~
## 7 SS913034 tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158.A.EA1~
## 8 forhorde5566 tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158.A.EA1~
## 9 okitawawa tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158.A.EA1~
## 10 frankochris tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158.A.EA1~
## # ... with 52 more rows
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
head(filtered_user,3)## user type
## 1 kissa0924307 replyer
## 2 spring53287 replyer
## 3 ARUSHI replyer
用使用者的身份來區分點的顏色
filter_degree = 10
set.seed(123)
# 設定 node 的 label/ color
reviewNetwork = degree(reviewNetwork) > 2
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
labels <- degree(reviewNetwork) # 算出每個點的degree
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
plot(
reviewNetwork,
vertex.size=3,
edge.width=3,
vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)我們可以看到基本的使用者關係,但是我們希望能夠將更進階的資訊視覺化。
例如:使用者經常參與的文章種類,或是使用者在該社群網路中是否受到歡迎。
挑選出2021-05-13當天的文章, 篩選一篇文章回覆0次以上者,且文章留言數多餘50則, 文章主題歸類為1(對下雨後的正向討論)與3(缺水的各種解決方法)者, 欄位只取:cmtPoster(評論者), artPoster(發文者), artUrl(文章連結), topic(主題)
link <- post_Review %>%
group_by(cmtPoster, artUrl) %>%
filter(n()>1) %>%
filter(commentNum > 50) %>%
filter(artCat=="Gossiping") %>% #HatePolitics / Gossiping
filter(artDate == as.Date('2021-05-13')) %>%
filter(topic == 3 | topic == 4) %>%
select(cmtPoster, artPoster, artUrl, topic) %>%
unique()
link## # A tibble: 60 x 4
## # Groups: cmtPoster, artUrl [60]
## cmtPoster artPoster artUrl topic
## <chr> <chr> <chr> <int>
## 1 bee5408 joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862~ 3
## 2 ayabehaori joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862~ 3
## 3 vvvvaaaa joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862~ 3
## 4 pass9487 joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862~ 3
## 5 kimpink joj4211 https://www.ptt.cc/bbs/Gossiping/M.1620889862~ 3
## 6 a1121210 tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158~ 3
## 7 SS913034 tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158~ 3
## 8 forhorde5566 tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158~ 3
## 9 okitawawa tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158~ 3
## 10 frankochris tommy910174 https://www.ptt.cc/bbs/Gossiping/M.1620893158~ 3
## # ... with 50 more rows
抓nodes
在所有的使用者裡面,篩選link中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
head(filtered_user,3)## user type
## 1 kissa0924307 replyer
## 2 spring53287 replyer
## 3 ARUSHI replyer
filter_degree = 10
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$topic == "2", "palevioletred", "lightgreen")
# 畫出社群網路圖(degree>7的才畫出來)
set.seed(5432)
plot(reviewNetwork, vertex.size=3, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) >= filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("討論台灣各地的水情問題","討論政府對於近期疫情、停電與水情的作為"),
col=c("palevioletred", "lightgreen"), lty=1, cex=0.7)我們可以看到基本的發文者與回覆者之間的關係,並發現與第四個主題’討論政府對於近期疫情、停電與水情的作為’中的文章像是缺水、缺電,台灣是不是被文組害死和韓國瑜對於疫情、缺水、缺電的看法都非常符合結果。
PTT的回覆有三種,推文、噓文、箭頭,我們只要看推噓就好,因此把箭頭清掉,這樣資料筆數較少,所以我們把篩選的條件放寬一些。
filter_degree = 7 # 使用者degree
# 過濾留言者對發文者的推噓程度
link <- post_Review %>%
filter(artCat=="Gossiping") %>%
filter(commentNum > 50) %>%
filter(cmtStatus!="→") %>%
group_by(cmtPoster, artUrl) %>%
filter( n() > 2) %>%
ungroup() %>%
select(cmtPoster, artPoster, artUrl, cmtStatus) %>%
unique()
# 接下來把網路圖畫出來,跟前面做的事都一樣,因此不再細述
# 篩選link中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$cmtStatus == "推", "lightgreen", "palevioletred")
# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=2, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("推","噓"),
col=c("lightgreen","palevioletred"), lty=1, cex=1)兩邊都涵蓋水庫水量、缺水還有一些報導的主題,前者主要重點在探討缺水以及執政黨的應對方式,後者則增加了對於限電、疫情的相關討論。
3、4月的結果以台灣各地缺水問題為主,推大於噓,但其實應該相反,因為較多反諷。 5、6月的結果以討論政府對於近期疫情、停電與水情的作為,推大於噓。