[1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8"
本資料內容為將PTT八卦板的文章,自 2020/09/01 到 2021/05/15 為止,透過文字分析平台進行關鍵字[藻礁]搜尋,共得到 517 篇文章。
可以看到藻礁議題在2~3月過後的新聞報導數量增加
metadata = metadata %>% select(-artPoster,-artCat,-commentNum,-push,-boo)
metadata%>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate) %>%
summarise(count = n())%>%
ggplot(aes(artDate,count))+
geom_line(color="red")+
geom_point()
# load stop words
stop_words <- scan(file = "./dict/stop_words.txt", what=character(),sep='\n',
encoding='utf-8',fileEncoding='utf-8')
Read 1211 items
# load water_lexicon
lexicon <- scan(file = "./dict/lexicon.txt", what=character(),sep='\n',
encoding='utf-8',fileEncoding='utf-8',quiet = T)
# 自建水情相關字典
lexicon
[1] "凱道" "環保團體" "蔡總統" "上千個" "何宗勳"
[6] "公民不合作運動" "生態協會" "環團" "環境權" "申請案"
[11] "海岸生態小組" "面紗" "北律環境法" "許嘉容" "洪申翰"
[16] "吳達彥" "長期" "張譽尹" "胡智皓" "蔡雅瀅"
[21] "陳憲政" "關注" "團體獎" "保育類" "大潭"
[26] "觀新藻礁" "多杯孔珊瑚" "領銜人" "潘忠政" "陳吉仲"
[31] "洪申翰" "藻礁" "黃暐瀚" "羅秉成" "蔡英文"
[36] "綠共" "核四" "重啟" "反核" "在地"
[41] "陳椒華" "柴山" "殼狀" "柯金源" "公民行動聯盟"
[46] "抗中保台" "行動聯盟" "在地" "林飛帆" "羅秉成"
[51] "林鶴明" "三接開發案" "張惇涵" "陳奕齊" "陳逸樺"
[56] "空污"
使用自行建立的詞典進行斷詞
metadata = metadata %>%
mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", sentence))
jieba_tokenizer = worker()
# 使用疫情相關字典重新斷詞
new_user_word(jieba_tokenizer, c(lexicon))
[1] TRUE
chi_tokenizer <- function(t) {
lapply(t, function(x) {
if(nchar(x)>1){
tokens <- segment(x, jieba_tokenizer)
tokens <- tokens[!tokens %in% stop_words]
# 去掉字串長度爲1的詞彙
tokens <- tokens[nchar(tokens)>1]
return(tokens)
}
})
}
計算每篇文章各token出現次數
tokens <- metadata %>%
unnest_tokens(word, sentence, token=chi_tokenizer) %>%
filter((!str_detect(word, regex("[0-9a-zA-Z]"))) | str_detect(word, regex("[Aa][Zz]"))) %>%
count(artUrl, word) %>%
rename(count=n)#把數字和英文拿掉
tokens %>% head(20)
<<DocumentTermMatrix (documents: 517, terms: 10867)>>
Non-/sparse entries: 36035/5582204
Sparsity : 99%
Maximal term length: 7
Weighting : term frequency (tf)
<<DocumentTermMatrix (documents: 10, terms: 10)>>
Non-/sparse entries: 18/82
Sparsity : 82%
Maximal term length: 3
Weighting : term frequency (tf)
Sample :
Terms
Docs 白紙 保護 杯子 標語 波波
https://www.ptt.cc/bbs/Gossiping/M.1599316636.A.647.html 1 1 3 1 1
https://www.ptt.cc/bbs/Gossiping/M.1599736664.A.F32.html 0 2 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1603116289.A.E0A.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1610188174.A.1A3.html 0 2 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1611161003.A.2CC.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613811584.A.0FC.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613821692.A.A97.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613888324.A.AB9.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613894790.A.454.html 0 2 0 1 0
https://www.ptt.cc/bbs/Gossiping/M.1613944743.A.DF7.html 0 2 0 0 0
Terms
Docs 蔡總統 參與 參與者 承諾 出面
https://www.ptt.cc/bbs/Gossiping/M.1599316636.A.647.html 1 2 1 3 1
https://www.ptt.cc/bbs/Gossiping/M.1599736664.A.F32.html 0 4 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1603116289.A.E0A.html 0 1 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1610188174.A.1A3.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1611161003.A.2CC.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613811584.A.0FC.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613821692.A.A97.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613888324.A.AB9.html 0 0 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613894790.A.454.html 0 1 0 0 0
https://www.ptt.cc/bbs/Gossiping/M.1613944743.A.DF7.html 0 0 0 0 0
lda <- LDA(dtm, k = 2, control = list(seed = 2021))#兩個topics
# lda <- LDA(dtm, k = 2, control = list(seed = 2021,alpha = 2,delta=0.1),method = "Gibbs") #調整alpha即delta
lda
A LDA_VEM topic model with 2 topics.
topics_words <- tidy(lda, matrix = "beta") # 注意,在tidy function裡面要使用"beta"來取出Phi矩陣。
colnames(topics_words) <- c("topic", "term", "phi")
topics_words#phi值越大,代表那個字越傾向那個topic
terms依照各主題的phi值由大到小排序,列出前10大
topics_words %>%
group_by(topic) %>%
top_n(10, phi) %>%
ungroup() %>%
mutate(top_words = reorder_within(term,phi,topic)) %>%
ggplot(aes(x = top_words, y = phi, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +theme(text = element_text(family='STHeitiTC-Light'))+
scale_x_reordered()
嘗試2、4、6、10、15個主題數,將結果存起來,再做進一步分析。 此部分需要跑一段時間,已經將跑完的檔案存成ldas_result2.rdata,可以直接載入
# ldas = c()
# topics = c(2,4,6,10,15)
# for(topic in topics){
# start_time <- Sys.time()
# lda <- LDA(dtm, k = topic, control = list(seed = 2021))
# ldas =c(ldas,lda)
# print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
# save(ldas,file = "ldas_result2.rdata") # 將模型輸出成檔案
# }
載入每個主題的LDA結果
topics = c(2,4,6,10,15)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")
create LDAvis所需的json function 此function是將前面使用 “LDA function”所建立的model,轉換為“LDAVis”套件的input格式。
topicmodels_json_ldavis <- function(fitted, doc_term){
require(LDAvis)
require(slam)
###以下function 用來解決,主題數多會出現NA的問題
### 參考 https://github.com/cpsievert/LDAvis/commit/c7234d71168b1e946a361bc00593bc5c4bf8e57e
ls_LDA = function (phi){
jensenShannon <- function(x, y) {
m <- 0.5 * (x + y)
lhs <- ifelse(x == 0, 0, x * (log(x) - log(m+1e-16)))
rhs <- ifelse(y == 0, 0, y * (log(y) - log(m+1e-16)))
0.5 * sum(lhs) + 0.5 * sum(rhs)
}
dist.mat <- proxy::dist(x = phi, method = jensenShannon)
pca.fit <- stats::cmdscale(dist.mat, k = 2)
data.frame(x = pca.fit[, 1], y = pca.fit[, 2])
}
# Find required quantities
phi <- as.matrix(posterior(fitted)$terms)
theta <- as.matrix(posterior(fitted)$topics)
vocab <- colnames(phi)
term_freq <- slam::col_sums(doc_term)
# Convert to json
json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
vocab = vocab,
doc.length = as.vector(table(doc_term$i)),
term.frequency = term_freq, mds.method = ls_LDA)
return(json_lda)
}
topics_words <- tidy(the_lda, matrix = "beta") # 注意,在tidy function裡面要使用"beta"來取出Phi矩陣。
colnames(topics_words) <- c("topic", "term", "phi")
topics_words %>% arrange(desc(phi)) %>% head(10)
terms依照各主題的phi值由大到小排序
topics_words %>%
group_by(topic) %>%
top_n(12, phi) %>%
ungroup() %>%
ggplot(aes(x = reorder_within(term,phi,topic), y = phi, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +theme(text = element_text(family='STHeitiTC-Light'))+
scale_x_reordered()
去除共通詞彙,以及沒有意義的詞彙
removed_word = c("不是","沒有","什麼","藻礁","大潭","可以","表示","就是","台灣","真的","有沒有")
topics_words %>%
filter(!term %in% removed_word) %>%
group_by(topic) %>%
top_n(12, phi) %>%
ungroup() %>%
ggplot(aes(x = reorder_within(term,phi,topic), y = phi, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() + theme(text = element_text(family='STHeitiTC-Light'))+
scale_x_reordered()
# for every document we have a probability distribution of its contained topics
tmResult <- posterior(the_lda)
doc_pro <- tmResult$topics
document_topics <- doc_pro[metadata$artUrl,]
document_topics_df =data.frame(document_topics)
colnames(document_topics_df) = topics_name
rownames(document_topics_df) = NULL
news_topic = cbind(metadata,document_topics_df)
#LDA alpha:
現在我們看每一篇的文章分佈了!
news_topic %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate = format(artDate,'%Y%m')) %>%
summarise_if(is.numeric, sum, na.rm = TRUE) %>%
melt(id.vars = "artDate")%>%
ggplot( aes(x=artDate, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("value") +
scale_fill_manual(values=mycolors[c(1,5,8,12)])+theme(text = element_text(family='STHeitiTC-Light'))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
news_topic %>%
mutate(artDate = as.Date(artDate)) %>%
filter( !format(artDate,'%Y%m') %in% c(202009,202010,202101))%>%
group_by(artDate = format(artDate,'%Y%m')) %>%
summarise_if(is.numeric, sum, na.rm = TRUE) %>%
melt(id.vars = "artDate")%>%
ggplot( aes(x=artDate, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("value")+scale_fill_manual(values=mycolors[c(1,5,8,12)])+theme(text = element_text(family='STHeitiTC-Light'))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
news_topic %>%
mutate(artDate = as.Date(artDate)) %>%
filter( !format(artDate,'%Y%m') %in% c(202009,202010,202101))%>%
group_by(artDate = format(artDate,'%Y%m')) %>%
summarise_if(is.numeric, sum, na.rm = TRUE) %>%
melt(id.vars = "artDate")%>%
group_by(artDate)%>%
mutate(total_value =sum(value))%>%
ggplot( aes(x=artDate, y=value/total_value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values=mycolors[c(1,5,8,12)])+theme(text = element_text(family='STHeitiTC-Light'))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
參考 http://text2vec.org/topic_modeling.html#latent_dirichlet_allocation
library(text2vec)
library(udpipe)
tokens <- metadata %>%
unnest_tokens(word, sentence, token=chi_tokenizer) %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))| str_detect(word, regex("[Aa][Zz]")))
dtf <- document_term_frequencies(tokens, document = "artUrl", term = "word")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 30)
dim(dtm_clean)
[1] 517 274
set.seed(2019)
topic_n = 4
lda_model =text2vec::LDA$new(n_topics = topic_n,doc_topic_prior = 0.1, topic_word_prior = 0.001)
doc_topic_distr =lda_model$fit_transform(dtm_clean, n_iter = 1000, convergence_tol = 1e-5,check_convergence_every_n = 100)
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 3%
|
|=== | 4%
|
|==== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|====== | 7%
|
|====== | 8%
|
|======= | 8%
|
|======= | 9%
|
|======== | 10%
|
|======== | 11%
|
|========= | 11%
|
|===============================================================================| 100%INFO [17:59:01.410] early stopping at 110 iteration
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 3%
|
|=== | 4%
|
|==== | 4%
|
|==== | 5%
|
|===============================================================================| 100%INFO [17:59:01.545] early stopping at 50 iteration
這個比topicmodels的package跑快超多倍
[,1] [,2] [,3] [,4]
[1,] "藻礁" "藻礁" "天然氣" "連署"
[2,] "大潭" "現在" "能源" "公投"
[3,] "桃園" "台灣" "方案" "藻礁"
[4,] "海岸" "問題" "電廠" "珍愛"
[5,] "生物" "一個" "接收站" "民進黨"
[6,] "生態" "真的" "轉型" "環團"
[7,] "地方" "支持" "政府" "溝通"
[8,] "觀新藻礁" "重要" "三接" "議題"
[9,] "研究" "鳳梨" "減煤" "同路人"
[10,] "中油" "是不是" "增加" "蔡英文"
Loading required namespace: servr
Failed with error: ‘there is no package called ‘servr’’
If the visualization doesn't render, install the servr package
and re-run serVis:
install.packages('servr')
Alternatively, you could configure your default browser to allow
access to local files as some browsers block this by default