系統設置

Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # For ubuntu

## [1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8"

# Sys.setlocale("LC_CTYPE", "cht") # For windows.

安裝需要的packages

packages = c("readr","tm", "data.table", "dplyr", "stringr", "jiebaR", "tidytext", "ggplot2", "tidyr", "topicmodels", "LDAvis", "webshot","purrr","ramify","RColorBrewer", "htmlwidgets","servr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

require(readr)
require(tm)
require(data.table)
require(dplyr)
require(stringr)
require(jiebaR)
require(udpipe)
require(tidytext)
require(ggplot2)
require(tidyr)
require(topicmodels)
require(LDAvis)
require(wordcloud2)
require(webshot)
require(htmlwidgets)
require(servr)
require(purrr)
require(ramify)
require(RColorBrewer)
mycolors <- colorRampPalette(brewer.pal(8, "Set3"))(20)

讀取資料

載入新聞資料

news= read.csv("new_mask_articleMetaData.csv",stringsAsFactors = FALSE) 
news$artDate = as.Date(news$artDate)

資料描述

透過中山管院文字分析平台在聯合新聞網、東森新聞網、中國時報新聞網，搜尋關鍵字：[口罩]，時間從2020-01-01~2020-05-10，五個月的資料，總篇數為1795篇。

news %>% 
  group_by(artDate) %>%
  summarise(count = n())%>%
  ggplot(aes(artDate,count))+
    geom_line(color="red")+
  geom_point()

可以觀察到資料主要分佈在2月初之後

Tokenization

# 使用默認參數初始化一個斷詞引擎
jieba_tokenizer = worker()
news_tokenizer <- function(t) {
  lapply(t, function(x) {
    if(nchar(x)>1){
      tokens <- segment(x, jieba_tokenizer)
      # 去掉字串長度爲1的詞彙
      tokens <- tokens[nchar(tokens)>1]
      return(tokens)
    }
  })
}

tokens <- news %>%
  unnest_tokens(word, sentence, token=news_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
  count(artUrl, word) %>%
  rename(count=n)
tokens %>%head(20)

## # A tibble: 20 x 3
##    artUrl                                      word     count
##    <chr>                                       <chr>    <int>
##  1 https://news.ebc.net.tw/news/article/194992 幫助         1
##  2 https://news.ebc.net.tw/news/article/194992 爆發         1
##  3 https://news.ebc.net.tw/news/article/194992 北京         1
##  4 https://news.ebc.net.tw/news/article/194992 鼻樑         4
##  5 https://news.ebc.net.tw/news/article/194992 避免         1
##  6 https://news.ebc.net.tw/news/article/194992 並將鼻       1
##  7 https://news.ebc.net.tw/news/article/194992 病毒         1
##  8 https://news.ebc.net.tw/news/article/194992 病例         2
##  9 https://news.ebc.net.tw/news/article/194992 病原體       1
## 10 https://news.ebc.net.tw/news/article/194992 不斷擴大     1
## 11 https://news.ebc.net.tw/news/article/194992 不過         1
## 12 https://news.ebc.net.tw/news/article/194992 不明         1
## 13 https://news.ebc.net.tw/news/article/194992 不少         1
## 14 https://news.ebc.net.tw/news/article/194992 不要         1
## 15 https://news.ebc.net.tw/news/article/194992 步驟         2
## 16 https://news.ebc.net.tw/news/article/194992 朝外         1
## 17 https://news.ebc.net.tw/news/article/194992 出入         1
## 18 https://news.ebc.net.tw/news/article/194992 出現         1
## 19 https://news.ebc.net.tw/news/article/194992 處方         1
## 20 https://news.ebc.net.tw/news/article/194992 觸摸         1

將資料轉換為Document Term Matrix (DTM)

news_dtm <-tokens %>% cast_dtm(artUrl, word, count)
news_dtm

## <<DocumentTermMatrix (documents: 1795, terms: 31102)>>
## Non-/sparse entries: 227677/55600413
## Sparsity           : 100%
## Maximal term length: 9
## Weighting          : term frequency (tf)

inspect(news_dtm[1:10,1:10])

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 29/71
## Sparsity           : 71%
## Maximal term length: 4
## Weighting          : term frequency (tf)
## Sample             :
##                                              Terms
## Docs                                          幫助 爆發 北京 鼻樑 避免 並將鼻
##   https://news.ebc.net.tw/news/article/194992    1    1    1    4    1      1
##   https://news.ebc.net.tw/news/article/195025    0    0    1    0    0      0
##   https://news.ebc.net.tw/news/article/195037    0    0    0    0    0      0
##   https://news.ebc.net.tw/news/article/195042    0    0    0    0    1      0
##   https://news.ebc.net.tw/news/article/195044    0    0    0    0    1      0
##   https://news.ebc.net.tw/news/article/195083    0    0    0    4    0      1
##   https://news.ebc.net.tw/news/article/195119    0    0    0    0    1      0
##   https://news.ebc.net.tw/news/article/195124    0    0    0    0    2      0
##   https://news.ebc.net.tw/news/article/195189    0    1    0    0    0      0
##   https://news.ebc.net.tw/news/article/195193    0    0    0    0    4      0
##                                              Terms
## Docs                                          病毒 病例 病原體 不斷擴大
##   https://news.ebc.net.tw/news/article/194992    1    2      1        1
##   https://news.ebc.net.tw/news/article/195025    0    2      0        0
##   https://news.ebc.net.tw/news/article/195037    0    0      0        0
##   https://news.ebc.net.tw/news/article/195042    0    2      0        0
##   https://news.ebc.net.tw/news/article/195044    0    2      0        0
##   https://news.ebc.net.tw/news/article/195083    0    1      0        1
##   https://news.ebc.net.tw/news/article/195119    4    0      0        0
##   https://news.ebc.net.tw/news/article/195124    0    1      0        0
##   https://news.ebc.net.tw/news/article/195189    0    1      0        0
##   https://news.ebc.net.tw/news/article/195193    3    1      0        0

查看DTM矩陣，可以發現是個稀疏矩陣。

建立LDA模型

lda <- LDA(news_dtm, k = 2, control = list(seed = 2020))

\(\phi\) Matrix

查看\(\phi\) matrix (topic * term)

topics <- tidy(lda, matrix = "beta") # 注意，在tidy function裡面要使用"beta"來取出Phi矩陣。
topics

## # A tibble: 62,204 x 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     1 幫助  1.70e- 4
##  2     2 幫助  2.51e- 4
##  3     1 爆發  5.51e- 4
##  4     2 爆發  4.93e- 4
##  5     1 北京  9.38e-16
##  6     2 北京  3.36e- 5
##  7     1 鼻樑  2.05e- 5
##  8     2 鼻樑  1.72e- 4
##  9     1 避免  9.85e- 4
## 10     2 避免  1.36e- 3
## # … with 62,194 more rows

從topics中可以得到特定主題生成特定詞彙的概率。

尋找Topic的代表字

top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  theme(text = element_text(family = "Heiti TC Light")) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

透過上方的圖，感覺兩個主題看起來差不多，沒有看出兩者的差異，嘗試看看分多一點topics

LDAvis

產生create LDAvis所需的json function

此function是將前面使用 “LDA function”所建立的model，轉換為“LDAVis”套件的input格式。

topicmodels_json_ldavis <- function(fitted, doc_term){
    require(LDAvis)
    require(slam)
  
   ###以下function 用來解決，主題數多會出現NA的問題
  ## 參考 https://github.com/cpsievert/LDAvis/commit/c7234d71168b1e946a361bc00593bc5c4bf8e57e
ls_LDA = function (phi)
{
  jensenShannon <- function(x, y) {
      m <- 0.5 * (x + y)
    lhs <- ifelse(x == 0, 0, x * (log(x) - log(m+1e-16)))
    rhs <- ifelse(y == 0, 0, y * (log(y) - log(m+1e-16)))
    0.5 * sum(lhs) + 0.5 * sum(rhs)
  }
  dist.mat <- proxy::dist(x = phi, method = jensenShannon)
  pca.fit <- stats::cmdscale(dist.mat, k = 2)
  data.frame(x = pca.fit[, 1], y = pca.fit[, 2])
}

    # Find required quantities
    phi <- as.matrix(posterior(fitted)$terms)
    theta <- as.matrix(posterior(fitted)$topics)
    vocab <- colnames(phi)
    term_freq <- slam::col_sums(doc_term)

    # Convert to json
    json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                            vocab = vocab,
                            doc.length = as.vector(table(doc_term$i)),
                            term.frequency = term_freq, mds.method = ls_LDA)

    return(json_lda)
}

產生LDAvis結果

可以透過 github的方式，線上demo你的LDAvis結果範例：https://234gts868.github.io/LDA/3_ldavis/#topic=0&lambda=0.5&term=

# 設置alpha及delta參數
#devotion_lda_removed <- LDA(devotion_dtm_removed, k = 4, method = "Gibbs", control = list(seed = 1234, alpha = 2, delta= 0.1))

####### 以下用來產生ldavis的檔案，可以之後用來在local端、放在網路上打開 ##########
# for(lda in ldas){
#   
#   k = lda@k ## lda 主題數
#   if(k==2){next}
#   json_res <- topicmodels_json_ldavis(lda,news_dtm)
#   # serVis(json_res,open.browser = T)
#   lda_dir =  paste0(k,"_ldavis")
#   if(!dir.exists(lda_dir)){ dir.create("./",lda_dir)}
#   
#   serVis(json_res, out.dir =lda_dir, open.browser = F)
#   
#   writeLines(iconv(readLines(paste0(lda_dir,"/lda.json")), to = "UTF8"))
# }

topic_10 = ldas[[3]]
json_res <- topicmodels_json_ldavis(topic_10,news_dtm)

serVis(json_res,open.browser = T)

# 如果無法開啟視窗(windows用戶)可執行這段
# serVis(json_res, out.dir = "vis", open.browser = T)
# writeLines(iconv(readLines("./vis/lda.json"), to = "UTF8"))

LDA後續分析

根據前面的探索之後，我們對於資料有更加了解，並且看完每個主題數的LDAvis之後，選定主題數10的結果來作後續的分析

news_lda = ldas[[3]] ## 選定topic 為10 的結果

topics <- tidy(news_lda, matrix = "beta") # 注意，在tidy function裡面要使用"beta"來取出Phi矩陣。
topics

## # A tibble: 311,020 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 幫助  0.000164  
##  2     2 幫助  0.000209  
##  3     3 幫助  0.000202  
##  4     4 幫助  0.0000993 
##  5     5 幫助  0.000369  
##  6     6 幫助  0.00000368
##  7     7 幫助  0.000167  
##  8     8 幫助  0.000195  
##  9     9 幫助  0.000435  
## 10    10 幫助  0.000292  
## # … with 311,010 more rows

尋找Topic的代表字

整理出每一個Topic中生成概率最高的10個詞彙。

top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values=mycolors)+
  theme(text = element_text(family = "Heiti TC Light")) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

可以看到topic都被一開始所使用的搜尋關鍵字影響看不出每一群的差異。

移除常出現、跨主題共享的詞彙。

remove_word = c("口罩","防疫","疫情","肺炎","民眾","新冠")
top_terms <- topics %>%
  filter(!term  %in% remove_word)%>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values=mycolors)+
  theme(text = element_text(family = "Heiti TC Light")) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

可以看出每個主題主要在討論什麼了！

主題命名

topic_name = c("口罩購買",'口罩需求','口罩產量','大眾運輸防疫政策','國際關係','藥局購買','學校防疫政策','台灣確診狀況','None','None2')

Document 主題分佈

# for every document we have a probability distribution of its contained topics
tmResult <- posterior(news_lda)
doc_pro <- tmResult$topics 
dim(doc_pro)               # nDocs(DTM) distributions over K topics

## [1] 1795   10

每篇文章都有topic的分佈，所以1795筆的文章*10個主題

cbind Document 主題分佈

# get document topic proportions 
document_topics <- doc_pro[news$artUrl,]
document_topics_df =data.frame(document_topics)
colnames(document_topics_df) = topic_name
rownames(document_topics_df) = NULL
news_topic = cbind(news,document_topics_df)
# news_topic %>% head(10)

現在我們看每一篇的文章分佈了！

查看特定主題的文章

透過找到特定文章的分佈進行排序之後，可以看到此主題的比重高的文章在討論什麼。

 news_topic %>%
    arrange(desc(`大眾運輸防疫政策`)) %>%head(10)

可以看到大眾運輸防疫政策的主題，大都在討論政府對於大眾運輸（雙鐵、公車）的防疫政策

了解主題在時間的變化

news_topic[,c(6:13)] =sapply(news_topic[,c(6:13)] , as.numeric)
news_topic %>% 
  group_by(artDate = format(artDate,'%Y%m')) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
  melt(id.vars = "artDate")%>%
 ggplot( aes(x=artDate, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("value") + 
  theme(text = element_text(family = "Heiti TC Light")) +
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

將None、None2的主題去除

去除筆數少月份、及None的主題

news_topic %>%
  #filter( !format(artDate,'%Y%m') %>%
  dplyr::select(-None, -None2)%>%
  group_by(artDate = format(artDate,'%Y%m')) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
  melt(id.vars = "artDate")%>%
 ggplot( aes(x=artDate, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("value") + 
  theme(text = element_text(family = "Heiti TC Light")) +
    scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

可以看出每個月的聲量，但是不能很清楚出每個月的比例

以比例了解主題時間變化

news_topic %>%
  #filter( !format(artDate,'%Y%m')%>%
  dplyr::select(-None, -None2)%>%
  group_by(artDate = format(artDate,'%Y%m')) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
melt(id.vars = "artDate")%>%
  group_by(artDate)%>%
  mutate(total_value =sum(value))%>%
 ggplot( aes(x=artDate, y=value/total_value, fill=variable)) + 
  theme(text = element_text(family = "Heiti TC Light")) +
  geom_bar(stat = "identity") + ylab("proportion") + 
      scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

現在我們可以看到每個月主題的佔比了！

口罩_Topic Model

讀書會第五組

5/10/2020

系統設置

安裝需要的packages

讀取資料

載入新聞資料

資料描述

Tokenization

將資料轉換為Document Term Matrix (DTM)

建立LDA模型

\(\phi\) Matrix

查看\(\phi\) matrix (topic * term)

尋找Topic的代表字

更多主題

載入每個主題的LDA結果

透過perplexity找到最佳主題數

補充 ldatuning

LDAvis

產生create LDAvis所需的json function

產生LDAvis結果

LDA後續分析

尋找Topic的代表字

主題命名

Document 主題分佈

cbind Document 主題分佈

查看特定主題的文章

了解主題在時間的變化

去除筆數少月份、及None的主題

以比例了解主題時間變化