使用文字分析平台抓取 PTT Gossiping 版中的文章與留言,透過R語言套件對臺灣近月時事進行整理與分析。
臺灣自年初至今發生許多事,而許多網友會在網路論壇上對事件進行熱烈討論,也因此網路論壇是一個了解民意與整理月度、年度事件的重要管道。在期末專案中,我們透過中山管院文字分析平台抓取PTT八卦版上的文字資料,希望透過整理與分析進行事件回顧,並了解從年初至今臺灣網友主要在討論哪些議題。
- 資料來源: PTT Gossiping 看板 2021-02-28 ~ 2021-05-31 所有文章與留言
- 搜尋關鍵字:「新聞」
- 資料時間:2021/03/01 ~ 2021/05/31
- 資料筆數:總共 14458 篇
- 補充說明:起初爬取資料時是從年初(2021/01/01)爬至五月底,但發現一、二月的文章較無法被歸類主題,可能是因為網友的討論並非針對特定事件或議題,因此最終選定爬取三月初到五月底的資料。
載入套件
library(readr)
require(data.table)## Loading required package: data.table
require(ggplot2)## Loading required package: ggplot2
require(dplyr)## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(jiebaR)## Loading required package: jiebaR
## Loading required package: jiebaRD
require(tidyr)## Loading required package: tidyr
require(tidytext)## Loading required package: tidytext
require(stringr)## Loading required package: stringr
require(tm)## Loading required package: tm
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
require(topicmodels)## Loading required package: topicmodels
require(purrr)## Loading required package: purrr
##
## Attaching package: 'purrr'
## The following object is masked from 'package:data.table':
##
## transpose
require(igraph)## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
require(reshape2)## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## The following objects are masked from 'package:data.table':
##
## dcast, melt
require(wordcloud2)## Loading required package: wordcloud2
require(widyr)## Loading required package: widyr
require(ggraph)## Loading required package: ggraph
require(networkD3)## Loading required package: networkD3
載入並清理資料
# 資料合併
MetaData = fread('articleMetaData_article.csv',encoding = 'UTF-8')
Reviews = fread('articleReviews_comment.csv',encoding = 'UTF-8')
# 進行文章篩選,篩選後剩下14032篇文章
keywords = c('\\[新聞\\]')
toMatch = paste(keywords,collapse="|")
MetaData = with(MetaData, MetaData[grepl(toMatch,artTitle),])
# 去除新聞共同字詞
MetaData <- MetaData %>%
mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", sentence)) %>%
filter(artUrl!='https://www.ptt.cc/bbs/Gossiping/M.1618387122.A.35E.html')
# 挑選文章對應的留言
Reviews = left_join(MetaData, Reviews[,c("artUrl","cmtPoster", "cmtContent","cmtStatus")], by = "artUrl")MetaData %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate) %>%
summarise(count = n())%>%
ggplot(aes(artDate,count)) +
geom_line(color="red") +
geom_point() +
labs(x="日期", y="文章數")由上圖可以看到,PTT版上的文章數在五月中旬有明顯數量增加,可能是因為臺灣本土疫情爆發,接下來我們將進行資料斷詞與轉換。
斷詞引擎初始化
jieba_tokenizer = worker(user="user_dict.txt", stop_word = "stop_words.txt")
ptt_tokenizer <- function(t) {
lapply(t, function(x) {
if(nchar(x)>1){
tokens <- segment(x, jieba_tokenizer)
# 去掉字串長度爲1的詞彙
tokens <- tokens[nchar(tokens)>1]
return(tokens)
}
})
}計算每篇文章各token出現次數,並預覽資料
tokens <- MetaData %>%
unnest_tokens(word, sentence, token=ptt_tokenizer) %>%
filter((!str_detect(word, regex("[0-9a-zA-Z]"))) | str_detect(word, regex("[Aa][Zz]"))) %>%
count(artUrl, word) %>%
rename(count=n)
tokens %>% head(20)## artUrl word count
## 1: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 工作 1
## 2: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 不完 1
## 3: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 中央社 1
## 4: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 中國 1
## 5: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 之上 1
## 6: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 今天 1
## 7: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 引發 1
## 8: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 文說 1
## 9: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 日本 2
## 10: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 日電 1
## 11: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 水準 1
## 12: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 出口 1
## 13: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 出名 1
## 14: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 加拿大 1
## 15: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 台北 1
## 16: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 台灣 3
## 17: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 外銷 1
## 18: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 市場 1
## 19: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 平均 1
## 20: https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 打開 1
將資料轉換為Document Term Matrix (DTM)
dtm <-tokens %>% cast_dtm(artUrl, word, count)
dtm## <<DocumentTermMatrix (documents: 14031, terms: 157648)>>
## Non-/sparse entries: 1801129/2210157959
## Sparsity : 100%
## Maximal term length: 32
## Weighting : term frequency (tf)
inspect(dtm[1:10,1:10])## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 17/83
## Sparsity : 83%
## Maximal term length: 3
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs 工作 不完 中央社
## https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 1 1 1
## https://www.ptt.cc/bbs/Gossiping/M.1614528524.A.C4F.html 2 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530068.A.A6C.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530257.A.0EE.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530335.A.46B.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530925.A.B00.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614532433.A.146.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614535734.A.588.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614549250.A.303.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614552867.A.B43.html 0 0 0
## Terms
## Docs 中國 之上 今天 引發
## https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 1 1 1 1
## https://www.ptt.cc/bbs/Gossiping/M.1614528524.A.C4F.html 15 1 6 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530068.A.A6C.html 0 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530257.A.0EE.html 0 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530335.A.46B.html 0 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530925.A.B00.html 0 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614532433.A.146.html 0 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614535734.A.588.html 0 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614549250.A.303.html 0 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614552867.A.B43.html 0 0 0 0
## Terms
## Docs 文說 日本 日電
## https://www.ptt.cc/bbs/Gossiping/M.1614528401.A.FE4.html 1 2 1
## https://www.ptt.cc/bbs/Gossiping/M.1614528524.A.C4F.html 0 2 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530068.A.A6C.html 0 12 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530257.A.0EE.html 0 1 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530335.A.46B.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614530925.A.B00.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614532433.A.146.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614535734.A.588.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614549250.A.303.html 0 0 0
## https://www.ptt.cc/bbs/Gossiping/M.1614552867.A.B43.html 0 0 0
建立更多主題的主題模型
# ldas = c()
# topics = c(4,6,8,10)
# for(topic in topics){
# start_time <- Sys.time()
# lda <- LDA(dtm, k = topic, control = list(seed = 2021))
# ldas =c(ldas,lda)
# print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
# save(ldas,file = "ldas_result_2.rdata") # 將模型輸出成檔案
# }載入每個主題的LDA結果
load("lda_result.rdata")topics = c(4,6,8,10)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(x = "主題數", y = "Perplexity")## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
根據上圖,我們將文章主題數訂為六個,接下來將使用LDA進行文章分析。
the_lda = ldas[[2]]
topics_words <- tidy(the_lda, matrix = "beta")
colnames(topics_words) <- c("topic", "term", "phi")
topics_words %>% arrange(desc(phi)) %>% head(10)## # A tibble: 10 x 3
## topic term phi
## <int> <chr> <dbl>
## 1 3 疫苗 0.0272
## 2 2 中國 0.0161
## 3 1 新聞 0.0135
## 4 3 台灣 0.0130
## 5 6 完整 0.0129
## 6 6 新聞 0.0120
## 7 4 完整 0.0108
## 8 5 完整 0.0107
## 9 3 完整 0.0101
## 10 2 台灣 0.0100
topics_words %>%
group_by(topic) %>%
top_n(10, phi) %>%
ungroup() %>%
ggplot(aes(x = reorder_within(term,phi,topic), y = phi, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered() removed_word = c( "新聞","完整","備註","記者","來源","媒體","表示","報導","目前","新聞標題","署名","內文","網址","連結","現在","影響","指出","留言","當時","一名","知道","已經","發生","男子","發現","調查","最後","人員","去年","萬元","公司","業者","民眾","鳳梨" )
topics_words %>%
filter(!term %in% removed_word) %>%
group_by(topic) %>%
top_n(10, phi) %>%
ungroup() %>%
ggplot(aes(x = reorder_within(term,phi,topic), y = phi, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()topics_name = c("娛樂新聞","國際新聞","疫苗相關新聞","工安意外","疫情相關新聞","社會案件")透過上述字詞,可將其分為以下主題:
tmResult <- posterior(the_lda)
doc_pro <- tmResult$topics
document_topics <- doc_pro[MetaData$artUrl,]
document_topics_df =data.frame(document_topics)
colnames(document_topics_df) = topics_name
rownames(document_topics_df) = NULL
ptt_topic = cbind(MetaData,document_topics_df)
# 刪除commentNum、push、boo欄位
ptt_topic$commentNum = NULL
ptt_topic$push = NULL
ptt_topic$boo = NULL透過找到特定文章的分佈進行排序之後,可以看到此主題的比重高的文章在討論什麼,以下以「娛樂新聞」以及「社會案件」為例,根據artTitle可知,在娛樂新聞中有YouTuber、娛樂節目等內容;在社會案件方面則有民眾糾紛、傷害、交通事故等內容。
topic_1<-ptt_topic %>%
arrange(desc(`娛樂新聞`)) %>% head(10)
topic_1topic_2<-ptt_topic %>%
arrange(desc(`社會案件`)) %>% head(10)
topic_2ptt_topic %>%
mutate(artDate = as.Date(artDate)) %>%
filter(format(artDate,'%Y%m')> 202102)%>%
group_by(artDate = format(artDate,'%Y%m')) %>%
summarise_if(is.numeric, sum, na.rm = TRUE) %>%
melt(id.vars = "artDate")%>%
ggplot( aes(x=artDate, y=value, fill=variable)) +
geom_bar(stat = "identity") + labs(x="日期",y="文章數",fill="文章主題") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))從上圖中可以看到,對應到第一部分的文章討論數量分佈圖,討論數量較高的月份剛好是2021年5月,對應到的主題是臺灣疫情爆發,因此可證實PTT八卦版上在五月中旬後文章討論數增加的原因是臺灣本土疫情爆發。
ptt_topic %>%
mutate(artDate = as.Date(artDate)) %>%
filter(format(artDate,'%Y%m')> 202102)%>%
group_by(artDate = format(artDate,'%Y%m')) %>%
summarise_if(is.numeric, sum, na.rm = TRUE) %>%
melt(id.vars = "artDate")%>%
group_by(artDate)%>%
mutate(total_value =sum(value))%>%
ggplot( aes(x=artDate, y=value/total_value, fill=variable)) +
geom_bar(stat = "identity") + labs(x="日期",y="比例",fill="文章主題") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
- 從比例圖中可以發現,娛樂新聞、國際新聞、社會案件,在疫情出現之前變化量不大。
- 2021年4月太魯閣出軌事件,以及3~4月頻繁發生的工人傷亡,使得在PTT上有許多針對「工安意外」這個主題的討論文章。
- 2021年5月因為COVID-19疫情爆發,疫情與疫苗相關新聞比例較多。
http://text2vec.org/topic_modeling.html#latent_dirichlet_allocation
library(text2vec)##
## Attaching package: 'text2vec'
## The following object is masked from 'package:igraph':
##
## normalize
## The following object is masked from 'package:topicmodels':
##
## perplexity
library(udpipe)
removed_word = c( "新聞","完整","備註","記者","來源","媒體","表示","報導","目前","新聞標題","署名","內文","網址","連結","現在","影響","指出","留言","當時","一名","知道","已經","發生","男子","發現","調查","最後","人員" )
tokens <- MetaData %>%
unnest_tokens(word, sentence, token=ptt_tokenizer) %>%
filter(!word %in% removed_word) %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))| str_detect(word, regex("[Aa][Zz]")))dtf <- document_term_frequencies(tokens, document = "artUrl", term = "word")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 30)
dim(dtm_clean)## [1] 14031 11255
set.seed(2019)
topic_n = 5
lda_model =text2vec::LDA$new(n_topics = topic_n,doc_topic_prior = 0.1, topic_word_prior = 0.001)
doc_topic_distr =lda_model$fit_transform(dtm_clean, n_iter = 1000, convergence_tol = 1e-5,check_convergence_every_n = 100)## INFO [01:49:19.150] early stopping at 350 iteration
## INFO [01:49:25.073] early stopping at 20 iteration
與topicmodels的package的結果相比較可發現:相較於topicmodels的package來說,如果以六個主題數使用LDAvis的話,會有重疊的主題。此處的主題二有「台灣經濟」、「政府補助」等字詞,主題六則有「台灣疫苗接種」、「政府採購疫苗」等字詞,兩個主題會有重疊的原因可能是主題一的字詞包含「政府」等關鍵字,和主題六在討論疫苗事件的主題相似而有所重複,也因此我們將主題減少為五個。
分成五個主題可以很清楚的分隔出各主題內容,不會有重疊的情況出現
lda_model$get_top_words(n = 10, lambda = 0.5)## [,1] [,2] [,3] [,4] [,5]
## [1,] "疫情" "警方" "網友" "疫苗" "政府"
## [2,] "防疫" "現場" "真的" "中國" "公司"
## [3,] "確診" "分局" "民進黨" "台灣" "市府"
## [4,] "指揮中心" "警察" "覺得" "美國" "台鐵"
## [5,] "感染" "駕駛" "看到" "國家" "蘇貞昌"
## [6,] "醫院" "太魯閣" "國民黨" "接種" "業者"
## [7,] "口罩" "女子" "臉書" "大陸" "行政院"
## [8,] "隔離" "家屬" "事情" "中共" "工程"
## [9,] "本土" "機車" "影片" "日本" "柯文"
## [10,] "中央" "法官" "應該" "國際" "億元"
lda_model$plot()## Loading required namespace: servr
link <- Reviews %>%
select(cmtPoster, artPoster, artUrl)
reviewNetwork <- graph_from_data_frame(d=link, directed=T)## Warning in graph_from_data_frame(d = link, directed = T): In `d' `NA' elements
## were replaced with string "NA"
計算發文者數量(4080)
length(unique(MetaData$artPoster))## [1] 4080
計算留言者數量(69357)
# 留言者數量 69357
length(unique(Reviews$cmtPoster))## [1] 69356
計算參與者總數量(70351)
# 參與者總數量
allPoster <- c(MetaData$artPoster, Reviews$cmtPoster)
length(unique(allPoster))## [1] 70351
整理所有出現過的使用者
userList <- data.frame(user = unique(allPoster)) %>%
mutate(type = ifelse(user%in%MetaData$artPoster, "poster", "replyer"))MetaData %>%
filter(commentNum < 200) %>%
ggplot(aes(x=commentNum)) + geom_histogram() + labs(x="留言數")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
PTT_poster = table(MetaData$artPoster) %>% sort %>% as.data.frame
colnames(PTT_poster) = c("artPoster","freq")
PTT_poster_fr = PTT_poster %>% filter(freq >= 25) # 發文次數 > 25
link <- Reviews %>%
filter(commentNum >= 100) %>% # 回覆數 > 100
filter(artPoster==PTT_poster_fr$artPoster) %>%
select(cmtPoster, artPoster, artUrl) %>%
unique()## Warning in `==.default`(artPoster, PTT_poster_fr$artPoster): 較長的物件長度並非
## 較短物件長度的倍數
## Warning in is.na(e1) | is.na(e2): 較長的物件長度並非較短物件長度的倍數
篩選在 link 中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))去除Graph中關係的方向性
set.seed(487)
# 先把關係的方向性拿掉,減少圖片中的不必要的資訊
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
plot(reviewNetwork, vertex.size=3, edge.arrow.size=.2, vertex.label=NA)set.seed(487)
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
# 依使用者的身份來區分點的顏色:有發文的話是紅色,只有回覆文章的則是淺藍色
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "red", "lightblue")
# 顯示超過 90 個關聯的使用者帳號
plot(reviewNetwork, vertex.size = 3, edge.arrow.size=.2,
vertex.label=ifelse(degree(reviewNetwork) >= 90, V(reviewNetwork)$label, NA), vertex.label.font = 2)由Graph視覺化結果,近一步分析 ezJapan, Wojnarowski, kivan00, hsnugear等帳號:
leader_data <- MetaData %>%
filter((artPoster == "ezJapan")|(artPoster == "Wojnarowski")|(artPoster == "kivan00")|(artPoster == "hsnugear"))
leader_data$artDate = as.Date(leader_data$artDate)
leader_data = leader_data %>% mutate(months = as.Date(cut(artDate, "months")))
leader_data_month = leader_data %>% group_by(months,artPoster) %>%
summarise(num=n()) %>% as.data.frame## `summarise()` has grouped output by 'months'. You can override using the `.groups` argument.
leader_data_month %>% ggplot(aes(x= months,y=num,fill=artPoster)) +geom_bar(stat = "identity")+
facet_wrap(~artPoster, ncol = 2, scales = "fixed") + labs(x="月份", y="發文頻率",fill="帳號")Wojnarowski_data <-leader_data %>% filter(artPoster=="Wojnarowski")
Wojnarowski_sentence <- Wojnarowski_data %>%
select(artUrl,sentence)
Wojnarowski_sentence <-strsplit(Wojnarowski_sentence$sentence,"[。!;?!?;]")
# 將每個句子與所屬的文章連結配對起來,整理成 dataframe
Wojnarowski_sentence <- data.frame(
artUrl = rep(Wojnarowski_data$artUrl, sapply(Wojnarowski_sentence, length)),
sentence = unlist(Wojnarowski_sentence)) %>%
filter(!str_detect(sentence, regex("^(\t|\n| )*$")))
Wojnarowski_sentence$sentence <- as.character(Wojnarowski_sentence$sentence)
# 使用斷詞引擎,放入要用的詞典和停用字
jieba_tokenizer = worker(user="user_dict.txt", stop_word = "stop_words.txt")
ptt_tokenizer <- function(t) {
lapply(t, function(x) {
if(nchar(x)>1){
tokens <- segment(x, jieba_tokenizer)
tokens <- tokens[nchar(tokens)>1]
return(tokens)
}
})
}
# 進行斷詞,並計算各詞彙在各文章中出現的次數
Wojnarowski_word <- Wojnarowski_sentence %>%
unnest_tokens(word, sentence, token=ptt_tokenizer) %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
count(artUrl, word, sort = TRUE)
removed_word = c( "新聞","完整","備註","記者","來源","媒體","表示","報導","目前","新聞標題","署名","內文","網址","連結","現在","影響","指出","留言","當時","一名","知道","已經","發生","男子","發現","調查","最後","人員","網友","看到","示意圖","圖片","一堆","完全","應該" )
# 畫出文字雲 因為文字雲會與networkD3 套件衝突
# 因此而使用圖片代替
# Wojnarowski_word %>%
# filter(!word %in% removed_word) %>%
# group_by(word) %>%
# summarise(sum = n()) %>%
# filter(sum > 3) %>%
# arrange(desc(sum)) %>%
# wordcloud2()Wojnarowski_word_cors <- Wojnarowski_word %>%
group_by(word) %>%
filter(n() >= 5) %>%
pairwise_cor(word, artUrl, sort = TRUE)## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
Wojnarowski_word_cors %>%
filter(correlation > 0.3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE, family='STXihei') +
theme_void()## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
各意見領袖常發文的新聞類型:
filter_degree = 3 # 使用者degree
# 過濾留言者對發文者的推噓程度
link <- Reviews %>%
filter((artPoster == "ezJapan")|(artPoster == "Wojnarowski")|(artPoster == "kivan00")|(artPoster == "hsnugear")) %>%
filter(commentNum > 80) %>%
filter(cmtStatus!="→") %>%
group_by(cmtPoster, artUrl) %>%
filter( n() > 5) %>%
ungroup() %>%
select(cmtPoster, artPoster, artUrl, cmtStatus) %>%
unique()
# 接下來把網路圖畫出來
# 篩選link中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "red", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$cmtStatus == "推", "#cacaca", "#5588a3")
# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=2, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
par(family='STXihei')
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("red","lightblue"), pt.cex=1, cex=1) ## Warning in strwidth(legend, units = "user", cex = cex, font = text.font): font
## family not found in Windows font database
legend("topleft", c("推","噓"),
col=c("#cacaca", "#5588a3"), lty=1, cex=1)觀察上圖可以發現 hsnugear 在社群網路圖中沒有出現,但其他三個都存在degree3以上的關係,而且推數大於噓數,此時發現一個有趣的現象,在社群網路圖中,可以找出常常在這三位使用者底下噓文的帳號 dodorol。
# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=2, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > 1, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
par(family='STXihei')
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("red","lightblue"), pt.cex=1, cex=1) ## Warning in strwidth(legend, units = "user", cex = cex, font = text.font): font
## family not found in Windows font database
legend("topleft", c("推","噓"),
col=c("#cacaca", "#5588a3"), lty=1, cex=1)在降低degree後發現 hsnugear 再次出現在社群網路圖中,代表可能意見領袖發布的內容之推噓程度會有落差。
library(networkD3)
links = Reviews %>%
filter(commentNum > 80) %>%
filter(cmtStatus!="→") %>%
group_by(cmtPoster, artUrl) %>%
filter( n() > 5) %>%
ungroup() %>%
select(cmtPoster, artPoster, artUrl, cmtStatus) %>%
unique()nodes = filtered_user
nodes$id = 0:(length(nodes$user) - 1)
# 整理資料格式
nodes_complete <- data.frame(nodeID = unique(c(links$cmtPoster, links$artPoster)))
nodes_complete$group <- nodes$type[match(nodes_complete$nodeID, nodes$user,nomatch=36)]
#nodes_complete<-nodes_complete %>% filter(!is.na(group))
links$source <- match(links$cmtPoster, nodes_complete$nodeID) - 1
links$target <- match(links$artPoster, nodes_complete$nodeID) - 1
# 畫圖
library(networkD3)
forceNetwork(Links = links, Nodes = nodes_complete, Source = "source",
Target = "target", NodeID = "nodeID", Group = "group",
opacity = 0.8, fontSize = 10, zoom = TRUE,legend = TRUE, opacityNoHover = TRUE,
colourScale = "d3.scaleOrdinal(d3.schemeCategory10);",
linkColour = ifelse(links$cmtStatus == "推", "#cacaca", "#5588a3") # 設定推噓顏色
)## Links is a tbl_df. Converting to a plain data frame.
從 2021/03/01 ~ 2021/05/31 PTT八卦版上的新聞,結合文章與留言在各個月的數量,可以歸納出六個主題:
透過社群網路圖我們發現,有較多的回文者及較少的使用者(發文+推噓文)