動機與分析目的

因為最近網紅是個很熱門的職業,不管老幼婦孺,只要敢展現自我、發揮創意,人人都有機會成為網紅。我們想探討以“館長”這個人格特質很突出,又在ptt討論熱度很高的網紅,是留給人們什麼的印象,他又在討論什麼才會引起大家那麼多的關注。

前置作業

資料取得及套件載入

載入的資料是由中山大學管理學院文字分析平台取得,在平台資料選擇下載原始資料所取得之csv檔案。

資料簡介

本資料為2019/01/01 ~ 2019/04/12 PTT八卦版之資料,透過文字分析平台檢索「館長」、「陳之漢」兩個關鍵字,共搜尋到646篇文章。

Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼
[1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8"

安裝需要的packages

packages = c("dplyr", "tidytext", "jiebaR", "gutenbergr", "stringr", "wordcloud2", "ggplot2", "tidyr", "scales", "widyr", "readr", "reshape2", "NLP", "ggraph", "igraph", "tm", "data.table", "quanteda", "Matrix", "slam", "Rtsne", "randomcoloR", "wordcloud", "topicmodels", "LDAvis", "webshot", "htmlwidgets","servr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
require(dplyr)
require(tidytext)
require(jiebaR)
require(gutenbergr)
require(stringr)
require(wordcloud2)
require(ggplot2)
require(tidyr)
require(scales)
require(widyr)
require(readr)
require(reshape2)
require(NLP)
require(ggraph)
require(igraph)
require(tm)
require(data.table)
require(quanteda)
require(Matrix)
require(slam)
require(Rtsne)
require(randomcoloR)
require(wordcloud)
require(topicmodels)
require(LDAvis)
require(webshot)
require(htmlwidgets)
require(servr)

資料載入

g_csv <- fread("guan_jang_data.csv", encoding = "UTF-8", header = TRUE)
g_csv <- g_csv %>% 
  filter(artUrl != "https://www.ptt.cc/bbs/Gossiping/M.1547888391.A.836.html")
g_csv$artDate = g_csv$artDate %>% as.Date("%Y/%m/%d")

預覽資料

head(g_csv)

日期折線圖

計算出每一天文章的發表數量,看出討論「館長」的熱度。

資料處理

g_date <- g_csv %>% 
  select(artDate, artUrl) %>% 
  distinct()
article_count_by_date <- g_date %>% 
  group_by(artDate) %>% 
  summarise(count = n())
article_count_by_date %>% 
  arrange(desc(count))%>% 
  top_n(10)
Selecting by count

討論篇數最多的前10天。

date_plot <- article_count_by_date %>% 
  ggplot(aes(x = artDate, y = count)) +
  geom_line(color = "purple", size = 1.5) + 
  geom_vline(xintercept = c(as.numeric(as.Date("2019-01-05")),
                            as.numeric(as.Date("2019-01-16")),
                            as.numeric(as.Date("2019-03-04")),
                            as.numeric(as.Date("2019-03-10")), 
                            as.numeric(as.Date("2019-04-09"))), col='red', size = 1) + 
  scale_x_date(labels = date_format("%Y/%m/%d")) +
  ggtitle("「館長」討論文章數") + 
  xlab("日期") + 
  ylab("數量") +
  theme(text = element_text(family = "Heiti TC Light"))
date_plot

從上圖中可以看到關於「館長」的討論在1/05, 1/16, 3/04, 3/10, 4/09 出現高點。
1/05: [問卦]館長嘴斷食,究竟哪邊對? [爆卦]館長:肏你媽有種現在就打過來啦
1/16: [新聞]館長促阿北2020選總統 柯P:我要再想想 [爆卦]館長正在跟柯文哲直播談虐童案
3/04: [新聞]「叫小賀!」孫安佐單挑館長影片曝光眼 [問卦]館長打得贏甄子丹嗎
3/07: 館長:統促黨3/10號要來林口館罵三字經!!!
3/10: [問卦]館長譙統促黨髒話本來就理虧,不是嗎? [新聞]統促黨嗆館長打一場 「簽生死狀,條件你
4/09: [新聞]與館長談統獨賴清德:統一就像斯斯有兩 [新聞]館長對決美國智庫?蔡賴今晚熱身賽

文字雲

接下來我們來大略觀察討論的內容為何,使用的方式為文字雲。

斷詞、停用詞使用

jieba_tokenizer <- worker(user="k_dict.txt", stop_word = "stop_words.txt")
g_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    tokens <- tokens[nchar(tokens)>1]
    return(tokens)
  })
}
g_tokens <- g_csv %>% 
  unnest_tokens(word, sentence, token=g_tokenizer) %>% 
  select(-artTime, -artUrl)
g_tokens_count <- g_tokens %>% 
  group_by(word) %>% 
  summarise(sum = n()) %>% 
  arrange(desc(sum))
head(g_tokens_count)

結果為詞出現最多的字。

移除「館長」

wordc_plot <- g_tokens_count %>% 
  filter(word != "館長") %>% 
  filter(sum > 20) %>% 
  wordcloud2()
wordc_plot

長條圖

長條圖可以查看精確的「最常出現詞彙」。

查看篇數最多的五天中,最常出現的詞彙。

g_tokens_by_date <- g_tokens %>% 
  count(artDate, word, sort = TRUE)
  
plot_merge <- g_tokens_by_date %>% 
  filter(word != "館長" & word != "直播") %>% 
  filter(artDate == as.Date("2019-01-05") | 
         artDate == as.Date("2019-01-16") | 
         artDate == as.Date("2019-03-04") |
         artDate == as.Date("2019-03-10") | 
         artDate == as.Date("2019-04-09")) %>% 
  group_by(artDate) %>% 
  top_n(7, n) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x=word, y=n, fill = artDate)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = NULL) +
  facet_wrap(~artDate, scales="free", ncol = 2) + 
  coord_flip()+
  theme(text = element_text(family = "Heiti TC Light"))
plot_merge

前面通過篇數找出的文章篇數高點的標題 1/05, 1/16, 3/04, 3/10, 4/09 。
1/05: [問卦]館長嘴斷食,究竟哪邊對? [爆卦]館長:肏你媽有種現在就打過來啦
1/16: [新聞]館長促阿北2020選總統 柯P:我要再想想 [爆卦]館長正在跟柯文哲直播談虐童案
3/04: [新聞]「叫小賀!」孫安佐單挑館長影片曝光眼 [問卦]館長打得贏甄子丹嗎
3/10: [問卦]館長譙統促黨髒話本來就理虧,不是嗎? [新聞]統促黨嗆館長打一場 「簽生死狀,條件你
4/09: [新聞]與館長談統獨賴清德:統一就像斯斯有兩 [新聞]館長對決美國智庫?蔡賴今晚熱身賽

計算tf-idf

以文章區格document

g_tokens_by_art <- g_tokens %>% 
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artTitle, word, sort = TRUE)
g_total_words_by_art <- g_tokens_by_art %>% 
  group_by(artTitle) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total))
g_tokens_by_art <- left_join(g_tokens_by_art, g_total_words_by_art)
Joining, by = "artTitle"

過濾掉文章長度少於20個詞的

g_words_tf_idf <- g_tokens_by_art %>%
  bind_tf_idf(word, artTitle, n) 
g_words_tf_idf %>% 
  filter(total > 20) %>% 
  arrange(desc(tf_idf))

文章本文:
背心尊者有強大的背心能力
平常館長也很常穿背心
不過我觀察館長穿的是寬鬆類背心
背心尊者穿的是緊身類背心
雖然是不同背心
但同樣都是背心愛好者
手下的教練 也是有背心軍團的資格
館長根本是真人版的背心尊者吧
大家覺得呢

文章總長度大於100個詞

g_words_tf_idf %>% 
  filter(total > 100) %>% 
  arrange(desc(tf_idf))

用日期來區隔document

g_tokens_by_date <- g_tokens %>% 
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artDate, word, sort = TRUE)
g_total_words_by_date <- g_tokens_by_date %>% 
  group_by(artDate) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total))
g_tokens_by_date <- left_join(g_tokens_by_date, g_total_words_by_date)
Joining, by = "artDate"
g_words_tf_idf_date <- g_tokens_by_date %>%
  bind_tf_idf(word, artDate, n) 
g_words_tf_idf_date %>% 
  filter(total > 20) %>% 
  group_by(artDate) %>% 
  top_n(1) %>% 
  arrange(artDate)
Selecting by tf_idf

找出前面五個日期篇數高點

g_words_tf_idf_date %>% 
  filter(total > 20) %>% 
  filter(artDate == as.Date("2019-01-05") | 
         artDate == as.Date("2019-01-16") | 
         artDate == as.Date("2019-03-04") |
         artDate == as.Date("2019-03-10") | 
         artDate == as.Date("2019-04-09")) %>% 
  group_by(artDate) %>%  
  top_n(1) %>% 
  arrange(artDate)
Selecting by tf_idf

前面通過篇數找出的文章篇數高點的標題 1/05, 1/16, 3/04, 3/10, 4/09 。
1/05: [問卦]館長嘴斷食,究竟哪邊對? [爆卦]館長:肏你媽有種現在就打過來啦
1/16: [新聞]館長促阿北2020選總統 柯P:我要再想想 [爆卦]館長正在跟柯文哲直播談虐童案
3/04: [新聞]「叫小賀!」孫安佐單挑館長影片曝光眼 [問卦]館長打得贏甄子丹嗎
3/10: [問卦]館長譙統促黨髒話本來就理虧,不是嗎? [新聞]統促黨嗆館長打一場 「簽生死狀,條件你
4/09: [新聞]與館長談統獨賴清德:統一就像斯斯有兩 [新聞]館長對決美國智庫?蔡賴今晚熱身賽

前後五個字彙

可看出常出現在「館長」附近的字。

ngram_11 <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    ngram <- ngrams(tokens, 11)
    ngram <- lapply(ngram, paste, collapse = " ")
    unlist(ngram)
  })
}
g_ngram_11 <- g_csv %>%
  select(artUrl, sentence) %>%
  unnest_tokens(ngram, sentence, token = ngram_11) %>%
  filter(!str_detect(ngram, regex("[0-9a-zA-Z]")))
g_ngrams_11_separated <- g_ngram_11 %>%
  separate(ngram, paste0("word", c(1:11),sep=""), sep = " ")
g_ngrams_11_separated
g_check_words <- g_ngrams_11_separated %>%
  filter((word6 == "館長"))
g_check_words
g_check_words_count <- g_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word=value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>%
  count(word, sort = TRUE)
g_check_words_count %>%
  arrange(desc(abs(n))) %>%
  head(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("出現在「館長」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light"))

Word Correlation

g_words_by_art <- g_csv %>%
  unnest_tokens(word, sentence, token=g_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artUrl, word, sort = TRUE)
g_word_pairs <- g_words_by_art %>%
  pairwise_count(word, artUrl, sort = TRUE)
g_word_pairs
g_word_cors <- g_words_by_art %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, artUrl, sort = TRUE)
g_word_cors %>%
  filter(item1 == "館長")

詞彙之間相關性

seed_words <- c("新聞", "綜合", "appledaily")
threshold <- 0.65
remove_words <- g_word_cors %>%
                filter((item1 %in% seed_words|item2 %in% seed_words), correlation>threshold) %>%
                .$item1 %>%
                unique()
set.seed(2017)
g_word_cors_new <- g_word_cors %>%
                filter(!(item1 %in% remove_words|item2 %in% remove_words))
g_word_cors_new %>%
  filter(correlation > .4) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()

分群

透過詞彙平均tf-idf,去除部分不重要的字

term_avg_tfidf <- g_words_tf_idf %>% 
  group_by(word) %>% 
  summarise(tfidf_avg = mean(tf_idf))
term_avg_tfidf$tfidf_avg %>% summary
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0001514 0.0325254 0.0521481 0.0988115 0.1138974 2.1033061 
term_remove=term_avg_tfidf %>%  
  filter(tfidf_avg<0.0325254) %>% 
  .$word
term_remove %>% head
[1] "阿嘎"   "阿共嚇" "啊災"   "挨告"   "愛民"   "愛鄉"  
g_dtm = g_words_tf_idf %>%
  filter(!word %in% term_remove) %>%
  cast_dtm(document=artTitle,term=word,value= n)
g_dtm
<<DocumentTermMatrix (documents: 550, terms: 8704)>>
Non-/sparse entries: 22402/4764798
Sparsity           : 100%
Maximal term length: 13
Weighting          : term frequency (tf)
g_dtm_matrix = g_dtm %>% as.data.frame.matrix 
g_dtm_matrix[1:10,1:20]

層級式分群

library(doParallel)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
clust = makeCluster(detectCores())
registerDoParallel(clust); getDoParWorkers()
[1] 4
t0 = Sys.time()
d = g_dtm_matrix %>%
  dist(method="euclidean")  #歐式距離,算文章與文章之間的距離
Sys.time() - t0
Time difference of 8.371428 secs
hc = hclust(d, method='ward.D')  
plot(hc, labels = FALSE, xlab = NULL)
rect.hclust(hc, k = 2, border="red")

kg = cutree(hc, k = 2)
L = split(g_dtm_matrix, kg)
L$`1`[1:10,1:10]
sapply(L, function(x) x%>% colMeans %>% sort %>% tail %>% names)
     1        2     
[1,] "賴清德" "影片"
[2,] "統促黨" "健身"
[3,] "總統"   "看到"
[4,] "完整"   "艾瑪"
[5,] "中國"   "台灣"
[6,] "台灣"   "八卦"

在二維平面圖上以文字雲分析不同群的字

# t0 = Sys.time()
# n = 2000 #n個字
# tsne = g_dtm[, 1:n] %>% as.data.frame.matrix %>%
#   scale %>% t %>% Rtsne(
#     check_duplicates = FALSE, theta=0.0, max_iter=3200)
# Sys.time()-t0
# Y = tsne$Y              # tSNE coordinates
# d_Y = dist(Y)             # distance matrix
# hc_Y = hclust(d_Y )          # hi-clustering
# plot(hc_Y,label=F)
# rect.hclust(hc_Y, k=10, border="red")
# K = 10                            # number of clusters
# g = cutree(hc_Y,K)                # cut into K clusters
# table(g) %>% as.vector %>% sort   # sizes of clusters
# wc = col_sums(g_dtm[,1:n]) #n個字
# colors = distinctColorPalette(K)
# png("./g.png", width=3200, height=1800)#輸出圖片到路徑下
# textplot(
#   Y[,1], Y[,2], colnames(g_dtm)[1:n], show=F,
#   col=colors[g],
#   cex= 0.3 + 1.25 * sqrt(wc/mean(wc)),
#   font=2, family = "Heiti TC Light")
# dev.off()

建立LDA模型

統計每篇文章詞頻

g_artid <- g_tokens %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>% 
  count(artTitle, word) %>% 
  rename(count=n) %>% 
  mutate(artId = group_indices(., artTitle))
g_artid
reserved_word <- g_artid %>% 
  group_by(word) %>% 
  count() %>% 
  filter(n > 5) %>% 
  unlist()
g_artid <- g_artid %>% 
  filter(word %in% reserved_word)

轉換為DTM

g_com_dtm <- g_artid %>% cast_dtm(artId, word, count)
g_com_dtm
<<DocumentTermMatrix (documents: 550, terms: 953)>>
Non-/sparse entries: 13381/510769
Sparsity           : 97%
Maximal term length: 6
Weighting          : term frequency (tf)

轉為分成兩群的LDA

g_lda <- LDA(g_com_dtm, k = 2, control = list(seed = 1234))

\(\phi\) Matrix

查看\(\phi\) matrix (topic * term)

g_topics <- tidy(g_lda, matrix = "beta")
g_topics

看分出來的兩個topic中,最常出現的詞

g_top_terms <- g_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
remove_words <- c("館長", "直播", "陳之漢", "台灣")
g_top_terms <- g_topics %>%
  filter(! term %in% remove_words) %>% 
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
g_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))

可看出在只分為兩個主題的情況下,結果並不是很明確。因為在主題一中,除了健身房本身的用詞外,還有統促黨被歸類在一起。 而主題二則明顯與政治與統獨相關。

兩主題之間相差最大的詞彙(topic2/topic1)

beta_spread <- g_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .0004 | topic2 > .0004 ) %>%
  mutate(log_ratio = log2(topic2 / topic1))
g_topic_ratio <- rbind(beta_spread %>% 
                         top_n(10,wt = log_ratio), 
                       beta_spread %>% 
                         top_n(-10, log_ratio)) %>%
  arrange(log_ratio)
g_topic_ratio %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
  geom_bar(stat="identity") + 
  xlab("Word")+
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))

正越大表示越傾向主題二,負越大越傾向主題一,可看出與先前的評論相同。
主題一傾向與健身房相關,主題二與政治相關。

\(\theta\) matrix (document * topic)

查看\(\theta\) matrix

g_documents <- tidy(g_lda, matrix="gamma")
g_documents

查看最被確認在兩個主題中的前十篇文章。

g_documents$document<- g_documents$document %>% as.integer()
g_documents %>% 
  group_by(topic) %>% 
  top_n(10,gamma) %>% 
  arrange(topic) %>% 
  inner_join(g_artid %>% distinct(artTitle,artId), by=c("document" = "artId")) %>% 
  select(topic, artTitle, gamma)

也可看出與先前所下的結論相近,主題一有健身房相關與統促黨,另外可看到與高雄市政府相關事件。
主題二則多為政治與統獨相關議題。

LDAvis

只分為兩個主題出來的結果並不是很明確,這裡改成分為三個主題。

topicmodels_json_ldavis <- function(fitted, doc_term){
    require(LDAvis)
    require(slam)
    phi <- as.matrix(posterior(fitted)$terms)
    theta <- as.matrix(posterior(fitted)$topics)
    vocab <- colnames(phi)
    term_freq <- slam::col_sums(doc_term)
    json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                            vocab = vocab,
                            doc.length = as.vector(table(doc_term$i)),
                            term.frequency = term_freq)
    return(json_lda)
}
g_ldavis <- LDA(g_com_dtm, k = 3, control = list(seed = 1234))
json_res <- topicmodels_json_ldavis(g_ldavis,g_com_dtm)
serVis(json_res, open.browser = T)
To stop the server, run servr::daemon_stop(1) or restart your R session
Serving the directory /private/var/folders/x1/z5l5bj0d1bx64tl585rmpz4w0000gn/T/RtmpOv7c4q/file78c6fec31fc at http://127.0.0.1:4321

與健身房相關。

與政治和統獨相關。

與統促黨相關。

在分成三個主題的設定下,將統促黨相關的詞從原先與健身房相關的詞中,獨立成一個新的主題。

分析館長熱度歷久不衰的原因

date_plot

持續對大家(鄉民)有興趣的議題發表看法。

透過直播談論政治議題或健身房相關事件。

g_csv %>% 
  select(artTitle, commentNum, push, boo) %>% 
  filter(commentNum >= 20) %>% 
  mutate(p_ratio = push/commentNum, b_ratio = boo/commentNum) %>% 
  arrange(-p_ratio)

大家對於館長政治立場相關的發言,推文的比例都是偏高的,可見很合鄉民的胃口。

結論

館長在PPT上討論度每天都維持大於1篇的文章數, 而在這些文章中,可以發現會引起鄉民討論主要是館長的直播內容。 館長在直播中大部分是在談政治議題,以及平常的時事, 由分析中可以發現,政治議題最能夠激起鄉民的討論,例如:統促黨、統一、中國、韓國瑜…等等 反倒是健身房討論度較少。

---
title: "社群媒體期中報告 - PTT八卦版：館長的討論分析"
author: "第七組 組員：陳琨翔、蔡宗諺、林意婕、王澤恩"
date: "2019/04/12"
output:
  html_notebook: default
  html_document: default
abstract: ""
---
# 動機與分析目的
> 因為最近網紅是個很熱門的職業，不管老幼婦孺，只要敢展現自我、發揮創意，人人都有機會成為網紅。我們想探討以"館長"這個人格特質很突出，又在ptt討論熱度很高的網紅，是留給人們什麼的印象，他又在討論什麼才會引起大家那麼多的關注。

# 前置作業

## 資料取得及套件載入
> 載入的資料是由中山大學管理學院文字分析平台取得，在平台資料選擇下載原始資料所取得之csv檔案。

### 資料簡介
> 本資料為2019/01/01 ~ 2019/04/12 PTT八卦版之資料，透過文字分析平台檢索「館長」、「陳之漢」兩個關鍵字，共搜尋到646篇文章。

```{r}
Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼
```

## 安裝需要的packages
```{r}
packages = c("dplyr", "tidytext", "jiebaR", "gutenbergr", "stringr", "wordcloud2", "ggplot2", "tidyr", "scales", "widyr", "readr", "reshape2", "NLP", "ggraph", "igraph", "tm", "data.table", "quanteda", "Matrix", "slam", "Rtsne", "randomcoloR", "wordcloud", "topicmodels", "LDAvis", "webshot", "htmlwidgets","servr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
```

```{r}
require(dplyr)
require(tidytext)
require(jiebaR)
require(gutenbergr)
require(stringr)
require(wordcloud2)
require(ggplot2)
require(tidyr)
require(scales)
require(widyr)
require(readr)
require(reshape2)
require(NLP)
require(ggraph)
require(igraph)
require(tm)
require(data.table)
require(quanteda)
require(Matrix)
require(slam)
require(Rtsne)
require(randomcoloR)
require(wordcloud)
require(topicmodels)
require(LDAvis)
require(webshot)
require(htmlwidgets)
require(servr)
```

## 資料載入
```{r}
g_csv <- fread("guan_jang_data.csv", encoding = "UTF-8", header = TRUE)
g_csv <- g_csv %>% 
  filter(artUrl != "https://www.ptt.cc/bbs/Gossiping/M.1547888391.A.836.html")
g_csv$artDate = g_csv$artDate %>% as.Date("%Y/%m/%d")
```

## 預覽資料
```{r}
head(g_csv)
```

# 日期折線圖
> 計算出每一天文章的發表數量，看出討論「館長」的熱度。

## 資料處理
```{r}
g_date <- g_csv %>% 
  select(artDate, artUrl) %>% 
  distinct()

article_count_by_date <- g_date %>% 
  group_by(artDate) %>% 
  summarise(count = n())

article_count_by_date %>% 
  arrange(desc(count))%>% 
  top_n(10)
```
> 討論篇數最多的前10天。

```{r}
date_plot <- article_count_by_date %>% 
  ggplot(aes(x = artDate, y = count)) +
  geom_line(color = "purple", size = 1.5) + 
  geom_vline(xintercept = c(as.numeric(as.Date("2019-01-05")),
                            as.numeric(as.Date("2019-01-16")),
                            as.numeric(as.Date("2019-03-04")),
                            as.numeric(as.Date("2019-03-10")), 
                            as.numeric(as.Date("2019-04-09"))), col='red', size = 1) + 
  scale_x_date(labels = date_format("%Y/%m/%d")) +
  ggtitle("「館長」討論文章數") + 
  xlab("日期") + 
  ylab("數量") +
  theme(text = element_text(family = "Heiti TC Light"))
date_plot
```

> 從上圖中可以看到關於「館長」的討論在1/05, 1/16, 3/04, 3/10, 4/09 出現高點。<br>
> 1/05: [問卦]館長嘴斷食，究竟哪邊對？               [爆卦]館長:肏你媽有種現在就打過來啦<br>
> 1/16: [新聞]館長促阿北2020選總統　柯P:我要再想想   [爆卦]館長正在跟柯文哲直播談虐童案<br>
> 3/04: [新聞]「叫小賀！」孫安佐單挑館長影片曝光眼   [問卦]館長打得贏甄子丹嗎<br>
> 3/07: 館長:統促黨3/10號要來林口館罵三字經!!!<br>
> 3/10: [問卦]館長譙統促黨髒話本來就理虧，不是嗎？   [新聞]統促黨嗆館長打一場　「簽生死狀，條件你<br>
> 4/09: [新聞]與館長談統獨賴清德：統一就像斯斯有兩   [新聞]館長對決美國智庫？蔡賴今晚熱身賽<br>

# 文字雲
> 接下來我們來大略觀察討論的內容為何，使用的方式為文字雲。

## 斷詞、停用詞使用
```{r}
jieba_tokenizer <- worker(user="k_dict.txt", stop_word = "stop_words.txt")

g_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    tokens <- tokens[nchar(tokens)>1]
    return(tokens)
  })
}
```

```{r}
g_tokens <- g_csv %>% 
  unnest_tokens(word, sentence, token=g_tokenizer) %>% 
  select(-artTime, -artUrl)

g_tokens_count <- g_tokens %>% 
  group_by(word) %>% 
  summarise(sum = n()) %>% 
  arrange(desc(sum))

head(g_tokens_count)
```
> 結果為詞出現最多的字。

## 移除「館長」
```{r}
wordc_plot <- g_tokens_count %>% 
  filter(word != "館長") %>% 
  filter(sum > 20) %>% 
  wordcloud2()
wordc_plot
```

# 長條圖
> 長條圖可以查看精確的「最常出現詞彙」。

## 查看篇數最多的五天中，最常出現的詞彙。
```{r}
g_tokens_by_date <- g_tokens %>% 
  count(artDate, word, sort = TRUE)
  
plot_merge <- g_tokens_by_date %>% 
  filter(word != "館長" & word != "直播") %>% 
  filter(artDate == as.Date("2019-01-05") | 
         artDate == as.Date("2019-01-16") | 
         artDate == as.Date("2019-03-04") |
         artDate == as.Date("2019-03-10") | 
         artDate == as.Date("2019-04-09")) %>% 
  group_by(artDate) %>% 
  top_n(7, n) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x=word, y=n, fill = artDate)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = NULL) +
  facet_wrap(~artDate, scales="free", ncol = 2) + 
  coord_flip()+
  theme(text = element_text(family = "Heiti TC Light"))
plot_merge
```

> 前面通過篇數找出的文章篇數高點的標題 1/05, 1/16, 3/04, 3/10, 4/09 。<br>
> 1/05: [問卦]館長嘴斷食，究竟哪邊對？               [爆卦]館長:肏你媽有種現在就打過來啦<br>
> 1/16: [新聞]館長促阿北2020選總統　柯P:我要再想想   [爆卦]館長正在跟柯文哲直播談虐童案<br>
> 3/04: [新聞]「叫小賀！」孫安佐單挑館長影片曝光眼   [問卦]館長打得贏甄子丹嗎<br>
> 3/10: [問卦]館長譙統促黨髒話本來就理虧，不是嗎？   [新聞]統促黨嗆館長打一場　「簽生死狀，條件你<br>
> 4/09: [新聞]與館長談統獨賴清德：統一就像斯斯有兩   [新聞]館長對決美國智庫？蔡賴今晚熱身賽<br>

# 計算tf-idf

## 以文章區格document
```{r}
g_tokens_by_art <- g_tokens %>% 
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artTitle, word, sort = TRUE)
g_total_words_by_art <- g_tokens_by_art %>% 
  group_by(artTitle) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total))
g_tokens_by_art <- left_join(g_tokens_by_art, g_total_words_by_art)
```

### 過濾掉文章長度少於20個詞的
```{r}
g_words_tf_idf <- g_tokens_by_art %>%
  bind_tf_idf(word, artTitle, n) 
g_words_tf_idf %>% 
  filter(total > 20) %>% 
  arrange(desc(tf_idf))
```

> 文章本文：<br>
背心尊者有強大的背心能力 <br>
平常館長也很常穿背心 <br>
不過我觀察館長穿的是寬鬆類背心 <br>
背心尊者穿的是緊身類背心 <br>
雖然是不同背心 <br>
但同樣都是背心愛好者 <br>
手下的教練 也是有背心軍團的資格 <br>
館長根本是真人版的背心尊者吧 <br>
大家覺得呢

### 文章總長度大於100個詞
```{r}
g_words_tf_idf %>% 
  filter(total > 100) %>% 
  arrange(desc(tf_idf))
```

## 用日期來區隔document
```{r}
g_tokens_by_date <- g_tokens %>% 
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artDate, word, sort = TRUE)
g_total_words_by_date <- g_tokens_by_date %>% 
  group_by(artDate) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total))
g_tokens_by_date <- left_join(g_tokens_by_date, g_total_words_by_date)
g_words_tf_idf_date <- g_tokens_by_date %>%
  bind_tf_idf(word, artDate, n) 
g_words_tf_idf_date %>% 
  filter(total > 20) %>% 
  group_by(artDate) %>% 
  top_n(1) %>% 
  arrange(artDate)
```

### 找出前面五個日期篇數高點
```{r}
g_words_tf_idf_date %>% 
  filter(total > 20) %>% 
  filter(artDate == as.Date("2019-01-05") | 
         artDate == as.Date("2019-01-16") | 
         artDate == as.Date("2019-03-04") |
         artDate == as.Date("2019-03-10") | 
         artDate == as.Date("2019-04-09")) %>% 
  group_by(artDate) %>%  
  top_n(1) %>% 
  arrange(artDate)
```

> 前面通過篇數找出的文章篇數高點的標題 1/05, 1/16, 3/04, 3/10, 4/09 。<br>
> 1/05: [問卦]館長嘴斷食，究竟哪邊對？               [爆卦]館長:肏你媽有種現在就打過來啦<br>
> 1/16: [新聞]館長促阿北2020選總統　柯P:我要再想想   [爆卦]館長正在跟柯文哲直播談虐童案<br>
> 3/04: [新聞]「叫小賀！」孫安佐單挑館長影片曝光眼   [問卦]館長打得贏甄子丹嗎<br>
> 3/10: [問卦]館長譙統促黨髒話本來就理虧，不是嗎？   [新聞]統促黨嗆館長打一場　「簽生死狀，條件你<br>
> 4/09: [新聞]與館長談統獨賴清德：統一就像斯斯有兩   [新聞]館長對決美國智庫？蔡賴今晚熱身賽<br>

# 前後五個字彙
> 可看出常出現在「館長」附近的字。

```{r}
ngram_11 <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    ngram <- ngrams(tokens, 11)
    ngram <- lapply(ngram, paste, collapse = " ")
    unlist(ngram)
  })
}
```

```{r}
g_ngram_11 <- g_csv %>%
  select(artUrl, sentence) %>%
  unnest_tokens(ngram, sentence, token = ngram_11) %>%
  filter(!str_detect(ngram, regex("[0-9a-zA-Z]")))
g_ngrams_11_separated <- g_ngram_11 %>%
  separate(ngram, paste0("word", c(1:11),sep=""), sep = " ")
g_ngrams_11_separated
```

```{r}
g_check_words <- g_ngrams_11_separated %>%
  filter((word6 == "館長"))
g_check_words
```

```{r}
g_check_words_count <- g_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word=value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>%
  count(word, sort = TRUE)

g_check_words_count %>%
  arrange(desc(abs(n))) %>%
  head(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("出現在「館長」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light"))
```

# Word Correlation
```{r}
g_words_by_art <- g_csv %>%
  unnest_tokens(word, sentence, token=g_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artUrl, word, sort = TRUE)
g_word_pairs <- g_words_by_art %>%
  pairwise_count(word, artUrl, sort = TRUE)
g_word_pairs
```

```{r}
g_word_cors <- g_words_by_art %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, artUrl, sort = TRUE)
g_word_cors %>%
  filter(item1 == "館長")
```

## 詞彙之間相關性
```{r}
seed_words <- c("新聞", "綜合", "appledaily")
threshold <- 0.65
remove_words <- g_word_cors %>%
                filter((item1 %in% seed_words|item2 %in% seed_words), correlation>threshold) %>%
                .$item1 %>%
                unique()

set.seed(2017)
g_word_cors_new <- g_word_cors %>%
                filter(!(item1 %in% remove_words|item2 %in% remove_words))
g_word_cors_new %>%
  filter(correlation > .4) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()
```

# 分群

## 透過詞彙平均tf-idf，去除部分不重要的字
```{r}
term_avg_tfidf <- g_words_tf_idf %>% 
  group_by(word) %>% 
  summarise(tfidf_avg = mean(tf_idf))
term_avg_tfidf$tfidf_avg %>% summary
```

```{r}
term_remove=term_avg_tfidf %>%  
  filter(tfidf_avg<0.0325254) %>% 
  .$word
term_remove %>% head
```

```{r}
g_dtm = g_words_tf_idf %>%
  filter(!word %in% term_remove) %>%
  cast_dtm(document=artTitle,term=word,value= n)
g_dtm
```

```{r}
g_dtm_matrix = g_dtm %>% as.data.frame.matrix 
g_dtm_matrix[1:10,1:20]
```

# 層級式分群
```{r}
library(doParallel)
clust = makeCluster(detectCores())
registerDoParallel(clust); getDoParWorkers()
```

```{r}
t0 = Sys.time()
d = g_dtm_matrix %>%
  dist(method="euclidean")  #歐式距離，算文章與文章之間的距離
Sys.time() - t0
```

```{r}
hc = hclust(d, method='ward.D')  
plot(hc, labels = FALSE, xlab = NULL)
rect.hclust(hc, k = 2, border="red")
```

```{r}
kg = cutree(hc, k = 2)
L = split(g_dtm_matrix, kg)
L$`1`[1:10,1:10]
```

```{r}
sapply(L, function(x) x%>% colMeans %>% sort %>% tail %>% names)
```

# 在二維平面圖上以文字雲分析不同群的字
```{r}
# t0 = Sys.time()
# n = 2000 #n個字
# tsne = g_dtm[, 1:n] %>% as.data.frame.matrix %>%
#   scale %>% t %>% Rtsne(
#     check_duplicates = FALSE, theta=0.0, max_iter=3200)
# Sys.time()-t0
```

```{r}
# Y = tsne$Y              # tSNE coordinates
# d_Y = dist(Y)             # distance matrix
# hc_Y = hclust(d_Y )          # hi-clustering
# plot(hc_Y,label=F)
# rect.hclust(hc_Y, k=10, border="red")
```

```{r}
# K = 10                            # number of clusters
# g = cutree(hc_Y,K)                # cut into K clusters
# table(g) %>% as.vector %>% sort   # sizes of clusters
```

```{r}
# wc = col_sums(g_dtm[,1:n]) #n個字
# colors = distinctColorPalette(K)
# png("./g.png", width=3200, height=1800)#輸出圖片到路徑下
# textplot(
#   Y[,1], Y[,2], colnames(g_dtm)[1:n], show=F,
#   col=colors[g],
#   cex= 0.3 + 1.25 * sqrt(wc/mean(wc)),
#   font=2, family = "Heiti TC Light")
# dev.off()
```

![](g.png)

# 建立LDA模型

## 統計每篇文章詞頻
```{r}
g_artid <- g_tokens %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>% 
  count(artTitle, word) %>% 
  rename(count=n) %>% 
  mutate(artId = group_indices(., artTitle))
g_artid
```

```{r}
reserved_word <- g_artid %>% 
  group_by(word) %>% 
  count() %>% 
  filter(n > 5) %>% 
  unlist()
g_artid <- g_artid %>% 
  filter(word %in% reserved_word)
```

## 轉換為DTM
```{r}
g_com_dtm <- g_artid %>% cast_dtm(artId, word, count)
g_com_dtm
```

## 轉為分成兩群的LDA
```{r}
g_lda <- LDA(g_com_dtm, k = 2, control = list(seed = 1234))
```

## $\phi$ Matrix

### 查看$\phi$ matrix (topic * term)

```{r}
g_topics <- tidy(g_lda, matrix = "beta")
g_topics
```

### 看分出來的兩個topic中，最常出現的詞
```{r}
g_top_terms <- g_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

remove_words <- c("館長", "直播", "陳之漢", "台灣")

g_top_terms <- g_topics %>%
  filter(! term %in% remove_words) %>% 
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

g_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))
```
> 可看出在只分為兩個主題的情況下，結果並不是很明確。因為在主題一中，除了健身房本身的用詞外，還有統促黨被歸類在一起。
> 而主題二則明顯與政治與統獨相關。

### 兩主題之間相差最大的詞彙(topic2/topic1)
```{r}
beta_spread <- g_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .0004 | topic2 > .0004 ) %>%
  mutate(log_ratio = log2(topic2 / topic1))

g_topic_ratio <- rbind(beta_spread %>% 
                         top_n(10,wt = log_ratio), 
                       beta_spread %>% 
                         top_n(-10, log_ratio)) %>%
  arrange(log_ratio)

g_topic_ratio %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
  geom_bar(stat="identity") + 
  xlab("Word")+
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))
```
> 正越大表示越傾向主題二，負越大越傾向主題一，可看出與先前的評論相同。<br>
> 主題一傾向與健身房相關，主題二與政治相關。

## $\theta$ matrix  (document * topic)

### 查看$\theta$ matrix
```{r}
g_documents <- tidy(g_lda, matrix="gamma")
g_documents
```

### 查看最被確認在兩個主題中的前十篇文章。
```{r}
g_documents$document<- g_documents$document %>% as.integer()
g_documents %>% 
  group_by(topic) %>% 
  top_n(10,gamma) %>% 
  arrange(topic) %>% 
  inner_join(g_artid %>% distinct(artTitle,artId), by=c("document" = "artId")) %>% 
  select(topic, artTitle, gamma)
```

> 也可看出與先前所下的結論相近，主題一有健身房相關與統促黨，另外可看到與高雄市政府相關事件。<br>
> 主題二則多為政治與統獨相關議題。

# LDAvis
> 只分為兩個主題出來的結果並不是很明確，這裡改成分為三個主題。

```{r}
topicmodels_json_ldavis <- function(fitted, doc_term){
    require(LDAvis)
    require(slam)

    phi <- as.matrix(posterior(fitted)$terms)
    theta <- as.matrix(posterior(fitted)$topics)
    vocab <- colnames(phi)
    term_freq <- slam::col_sums(doc_term)

    json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                            vocab = vocab,
                            doc.length = as.vector(table(doc_term$i)),
                            term.frequency = term_freq)
    return(json_lda)
}
```

```{r}
g_ldavis <- LDA(g_com_dtm, k = 3, control = list(seed = 1234))

json_res <- topicmodels_json_ldavis(g_ldavis,g_com_dtm)

serVis(json_res, open.browser = T)
```
![](topic1.png)

> 與健身房相關。

![](topic2.png)

> 與政治和統獨相關。

![](topic3.png)

> 與統促黨相關。

> 在分成三個主題的設定下，將統促黨相關的詞從原先與健身房相關的詞中，獨立成一個新的主題。

# 分析館長熱度歷久不衰的原因

```{r}
date_plot
```
> 持續對大家(鄉民)有興趣的議題發表看法。

![](wordcloud.png)

> 透過直播談論政治議題或健身房相關事件。

```{r}
g_csv %>% 
  select(artTitle, commentNum, push, boo) %>% 
  filter(commentNum >= 20) %>% 
  mutate(p_ratio = push/commentNum, b_ratio = boo/commentNum) %>% 
  arrange(-p_ratio)
```
> 大家對於館長政治立場相關的發言，推文的比例都是偏高的，可見很合鄉民的胃口。

# 結論
> 館長在PPT上討論度每天都維持大於1篇的文章數，
> 而在這些文章中，可以發現會引起鄉民討論主要是館長的直播內容。
> 館長在直播中大部分是在談政治議題，以及平常的時事，
> 由分析中可以發現，政治議題最能夠激起鄉民的討論，例如:統促黨、統一、中國、韓國瑜...等等
> 反倒是健身房討論度較少。




