動機與分析目的：由於最近疫情影響的關係，政府提出發布振興券政策，但因新聞及ptt上面的討論都不是很正面，而馬英九前總統執政的時候也曾發過消費券，我們想從文章中了解大眾對這些政策的看法

系統參數設定

Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼

## [1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8"

安裝需要的packages

packages = c("dplyr", "tidytext", "jiebaR", "gutenbergr", "stringr", "wordcloud2","tidyr", "scales", "widyr", "readr", "reshape2", "NLP", "ggraph", "igraph", "tm", "data.table", "quanteda", "Matrix", "slam", "Rtsne", "randomcoloR", "wordcloud", "topicmodels", "LDAvis", "webshot", "htmlwidgets","servr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

require(dplyr)
require(tidytext)
require(jiebaR)
require(gutenbergr)
require(stringr)
require(wordcloud2)
require(ggplot2)
require(tidyr)
require(scales)
require(widyr)
require(readr)
require(reshape2)
require(NLP)
require(ggraph)
require(igraph)
require(tm)
require(data.table)
require(quanteda)
require(Matrix)
require(slam)
require(Rtsne)
require(randomcoloR)
require(wordcloud)
require(topicmodels)
require(LDAvis)
require(webshot)
require(htmlwidgets)
require(servr)
library(ggplot2)

資料載入:本資料為2009/01/01 ~ 2020/06/10 PTT Gossiping 之資料，透過文字分析平台檢索「消費券」、「振興券」、「酷碰券」等等的關鍵字。

m <- read_csv('./data/振興券_artWordFreq.csv')

## Parsed with column specification:
## cols(
##   artTitle = col_character(),
##   artDate = col_date(format = ""),
##   artTime = col_time(format = ""),
##   artUrl = col_character(),
##   word = col_character(),
##   count = col_double()
## )

mr <- read_csv('./data/振興券_articleReviews.csv')

## Parsed with column specification:
## cols(
##   artTitle = col_character(),
##   artDate = col_date(format = ""),
##   artTime = col_time(format = ""),
##   artUrl = col_character(),
##   artPoster = col_character(),
##   artCat = col_character(),
##   cmtPoster = col_character(),
##   cmtStatus = col_character(),
##   cmtDate = col_datetime(format = ""),
##   cmtContent = col_character()
## )

mr$artDate = mr$artDate %>% as.Date("%Y/%m/%d")

預覽資料

head(mr)

## # A tibble: 6 x 10
##   artTitle artDate    artTime artUrl artPoster artCat cmtPoster cmtStatus
##   <chr>    <date>     <time>  <chr>  <chr>     <chr>  <chr>     <chr>    
## 1 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… wensandra 推       
## 2 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
## 3 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… jffry6663 →        
## 4 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
## 5 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… wensandra →        
## 6 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… ah937609  →        
## # … with 2 more variables: cmtDate <dttm>, cmtContent <chr>

斷詞、停用詞使用

jieba_tokenizer <- worker(user="user_dict.txt", stop_word = "stop_words.txt")
tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    tokens <- tokens[nchar(tokens)>1]
    return(tokens)
  })
}

n_tokens <- mr %>% unnest_tokens(word, cmtContent, token=tokenizer)
str(n_tokens)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 47464 obs. of  10 variables:
##  $ artTitle : chr  "[問卦]有沒有當年消費券怎麼用的八卦" "[問卦]有沒有當年消費券怎麼用的八卦" "[問卦]有沒有當年消費券怎麼用的八卦" "[問卦]有沒有當年消費券怎麼用的八卦" ...
##  $ artDate  : Date, format: "2017-01-17" "2017-01-17" ...
##  $ artTime  : 'hms' num  17:37:27 17:37:27 17:37:27 17:37:27 ...
##   ..- attr(*, "units")= chr "secs"
##  $ artUrl   : chr  "https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html" "https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html" "https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html" "https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html" ...
##  $ artPoster: chr  "kmjhome" "kmjhome" "kmjhome" "kmjhome" ...
##  $ artCat   : chr  "Gossiping" "Gossiping" "Gossiping" "Gossiping" ...
##  $ cmtPoster: chr  "wensandra" "wensandra" "wensandra" "js52666" ...
##  $ cmtStatus: chr  "推" "推" "推" "→" ...
##  $ cmtDate  : POSIXct, format: "2017-01-18 01:37:00" "2017-01-18 01:37:00" ...
##  $ word     : chr  "之前" "一次" "剛剛" "一樓" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   artTitle = col_character(),
##   ..   artDate = col_date(format = ""),
##   ..   artTime = col_time(format = ""),
##   ..   artUrl = col_character(),
##   ..   artPoster = col_character(),
##   ..   artCat = col_character(),
##   ..   cmtPoster = col_character(),
##   ..   cmtStatus = col_character(),
##   ..   cmtDate = col_datetime(format = ""),
##   ..   cmtContent = col_character()
##   .. )

head(n_tokens, 20)

## # A tibble: 20 x 10
##    artTitle artDate    artTime artUrl artPoster artCat cmtPoster cmtStatus
##    <chr>    <date>     <time>  <chr>  <chr>     <chr>  <chr>     <chr>    
##  1 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… wensandra 推       
##  2 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… wensandra 推       
##  3 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… wensandra 推       
##  4 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
##  5 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… jffry6663 →        
##  6 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
##  7 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
##  8 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
##  9 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… ah937609  →        
## 10 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… Dooo      →        
## 11 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… Dooo      →        
## 12 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… Dooo      →        
## 13 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… nestea91… 推       
## 14 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… nestea91… 推       
## 15 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… nestea91… 推       
## 16 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… GTA0328   推       
## 17 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… GTA0328   推       
## 18 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… GTA0328   推       
## 19 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… eric999   推       
## 20 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… eric999   推       
## # … with 2 more variables: cmtDate <dttm>, word <chr>

計算詞頻

# 計算詞彙的出現次數，如果詞彙只有一個字則不列入計算
tokens_count_by_date <- n_tokens %>%
  #filter(nchar(.$word)>1) %>%
  group_by(word, artDate) %>%
  #group_by(word) %>%
  summarise(sum = n()) %>%
  filter(sum>3) %>%
  arrange(desc(sum))
head(tokens_count_by_date)

## # A tibble: 6 x 3
## # Groups:   word [4]
##   word   artDate      sum
##   <chr>  <date>     <int>
## 1 消費券 2020-05-28   130
## 2 真的   2020-05-28   116
## 3 馬英九 2020-05-28    96
## 4 一家親 2020-04-17    84
## 5 消費券 2020-04-08    77
## 6 消費券 2020-04-04    68

情緒分析

準備LIWC字典

以LIWC字典判斷文集中的word屬於正面字還是負面字

# 正向字典txt檔
# 以,將字分隔
P <- read_file("dict/liwc/positive.txt")

# 負向字典txt檔
N <- read_file("dict/liwc/negative.txt")

#字典txt檔讀進來是一個字串
typeof(N)

## [1] "character"

#將字串依,分割
#strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]

# 建立dataframe 有兩個欄位word,sentiments，word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive")
N = data.frame(word = N, sentiment = "negative")

LIWC = rbind(P, N)

與LIWC情緒字典做join

文集中的字出現在LIWC字典中是屬於positive還是negative

tokens_count_by_date %>% inner_join(LIWC)

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## # A tibble: 150 x 4
## # Groups:   word [55]
##    word  artDate      sum sentiment
##    <chr> <date>     <int> <fct>    
##  1 幫助  2020-05-29    37 positive 
##  2 可憐  2020-05-28    33 negative 
##  3 垃圾  2020-05-28    25 negative 
##  4 失敗  2020-05-28    25 negative 
##  5 垃圾  2017-07-25    23 negative 
##  6 垃圾  2020-05-29    18 negative 
##  7 垃圾  2020-04-04    16 negative 
##  8 問題  2020-05-29    16 negative 
##  9 智障  2020-05-28    16 negative 
## 10 簡單  2020-05-28    15 positive 
## # … with 140 more rows

計算所有字在文集中的總詞頻

word_count <- n_tokens %>%
  #select(word) %>% 
  group_by(word) %>% 
  summarise(count = n())  %>%
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))
head(word_count, 100)

## # A tibble: 100 x 2
##    word   count
##    <chr>  <int>
##  1 消費券   945
##  2 消費     412
##  3 政府     402
##  4 真的     376
##  5 台灣     345
##  6 消費卷   323
##  7 政策     261
##  8 馬英九   256
##  9 經濟     245
## 10 垃圾     243
## # … with 90 more rows

我們是以消費券、消費卷等等的字抓資料，這些詞出現頻率會比較高，要把這些詞剔除

文字雲

word_count %>% 
   filter(word !="消費卷") %>%
   filter(word !="消費券") %>%
   filter(word !="振興卷") %>%
   filter(word !="振興券") %>%
   filter(word !="武漢肺炎") %>%
   filter(word !="武漢病毒") %>%
   filter(word !="酷碰券") %>%
   filter(word !="新冠肺炎") %>%
   filter(count>30) %>%
   wordcloud2()

剔出關鍵字後做出文字雲：出現3600、1000、2000以及經濟等的詞彙較常被提及，都是與振興券相關的詞彙

長條圖

plot_merge <- word_count %>% 
  filter(word !="消費卷") %>%
   filter(word !="消費券") %>%
   filter(word !="振興卷") %>%
   filter(word !="振興券") %>%
   filter(word !="武漢肺炎") %>%
   filter(word !="武漢病毒") %>%
   filter(word !="酷碰券") %>%
   filter(word !="新冠肺炎") %>%
  #group_by(type) %>% 
  top_n(30, count) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, count)) %>%
  ggplot(aes(x=word, y=count)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y="詞頻") +
  theme(text=element_text(size=14, family = "Heiti TC Light"))+
  #facet_wrap(~type, ncol = 1, scales="free") + 
  coord_flip()

plot_merge

我們的資料是從2009抓到2020，當中會有一些與馬政府時代發放消費券討論相關的詞彙

每月貼文直方圖

m_post <- m
m_date <- m %>%
  filter(word !="消費卷") %>%
   filter(word !="消費券") %>%
   filter(word !="振興卷") %>%
   filter(word !="振興券") %>%
   filter(word !="武漢肺炎") %>%
   filter(word !="武漢病毒") %>%
   filter(word !="酷碰券") %>%
   filter(word !="新冠肺炎") %>%
  select(artDate, artUrl, word) %>%
  distinct()
m_article_count_by_date <- m_date %>%
  group_by(artDate) %>%
  summarise(count = n())

m_post <- m %>% mutate(ym = format(m_post$artDate, format = "%Y/%m"))   #抓出年/月份

m_months <- m_post %>%
  select(artUrl, ym) %>%
  distinct()
m_article_count_by_month <- m_months %>%
  group_by(ym) %>%
  summarise(count = n())

m_months_plot <- m_months %>%
  ggplot(aes(x = ym)) +
  geom_bar(color = "pink", size = 0.5) +
  ggtitle("振興券討論文章數(月份)") +
  geom_text(aes(label=..count..), stat="count", color = "black", vjust = -0.5, size = 3) +
  theme(text = element_text(family = "Heiti TC Light"))+
  xlab("y/m") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) 
m_months_plot

查看篇數最多的三個月，最常出現的詞彙

m_tokens_by_month <- m_post %>%
  count(ym, word, sort = T)


#2019年(三個文章數一樣多的月份)的詞頻
m_19 <- m_tokens_by_month %>%
  filter(ym == "2018/02"| ym == "2020/04" | ym == "2020/05") %>%
  group_by(ym) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col(show.legend = FALSE) +
  theme(text = element_text(family = "Heiti TC Light"))+
  facet_wrap(~ym, scales="free", ncol = 2) + 
  coord_flip()
m_19

2018/02: 出現花蓮是因為許多鄉民拿到消費券後紛紛表示想去花蓮玩，除此之外也有有人希望發放消費券振興花蓮觀光業

2020/04: 提到經濟、疫情

2020/05: 提到馬英九、效果

兩天討論的內容大同小異

m_comment <- n_tokens
m_comment <- m_comment %>%
  filter(word !="消費卷") %>%
   filter(word !="消費券") %>%
   filter(word !="振興卷") %>%
   filter(word !="振興券") %>%
   filter(word !="武漢肺炎") %>%
   filter(word !="武漢病毒") %>%
   filter(word !="酷碰券") %>%
   filter(word !="新冠肺炎") %>%
  mutate(ym = format(artDate, format = "%Y/%m"))
m_ctokens_by_month <- m_comment %>%
  count(ym, word, sort = T)

每月留言直方圖

m_cmonths <- m_comment %>%
  select(ym, artUrl) 

m_cmonths_plot <- m_cmonths %>%
  ggplot(aes(x = ym)) +
  geom_bar(color = "pink", size = 0.5) +
  ggtitle("留言數(月份)") +
  geom_text(aes(label=..count..), stat="count", color = "black", vjust = -0.5, size = 3) +
  theme(text = element_text(family = "Heiti TC Light"))+
  xlab("y/m") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) 
m_cmonths_plot

#最多文章月份的詞頻
m_ctokens_by_month <- m_comment %>%
  count(word, ym, sort = T)

m_c1907 <- m_ctokens_by_month %>%
  filter(ym == "2018/02" | ym == "2020/04" | ym == "2020/05") %>%
  group_by(ym) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col(show.legend = FALSE) +
  theme(text = element_text(family = "Heiti TC Light"))+
  facet_wrap(~ym, scales="free", ncol = 2) + 
  coord_flip()
m_c1907

2018/02: 內容較沒有談論到對疫情的幫助

2020/04: 經濟、中國（與疫情相關）

2020/05: 馬英九、與以往比較效果

留言情緒分析

m_comment <- m_comment %>% inner_join(LIWC)

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

m_c_count <- m_comment %>% select(sentiment, word, artDate) %>%
  group_by(word) %>%
  summarise(count = n()) #加word的詞頻
m_comment <- m_comment %>% left_join(m_c_count) #與s_comment合併

## Joining, by = "word"

m_comment <- m_comment %>% mutate(year = format(artDate, format = "%Y"))

m_comment %>%
  filter(year == "2016" | year == "2017" | year == "2018" | year == "2019" | year == "2020" ) %>%
  ggplot() +
  geom_line(aes(x = artDate, y = count, colour = sentiment)) +
  scale_x_date(labels =date_format("%m/%d")) +
  facet_wrap(~year, scales="free", ncol = 2)

統計每天的文章正面字的次數與負面字的次數

sentiment_count = tokens_count_by_date %>%
  select(artDate,word,sum) %>%
  inner_join(LIWC) %>% 
  group_by(artDate,sentiment) %>%
  summarise(count=sum(sum))

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%y/%m/%d"))

sentiment_count[order(-sentiment_count$count),] %>%
  top_n(10)

## Selecting by count

## # A tibble: 50 x 3
## # Groups:   artDate [38]
##    artDate    sentiment count
##    <date>     <fct>     <int>
##  1 2020-05-28 negative    214
##  2 2020-05-29 negative    106
##  3 2020-05-28 positive     98
##  4 2020-05-29 positive     91
##  5 2020-04-04 negative     69
##  6 2020-04-08 negative     37
##  7 2017-07-25 negative     33
##  8 2018-02-17 negative     27
##  9 2017-01-18 negative     24
## 10 2019-01-01 negative     22
## # … with 40 more rows

sentiment_count %>%
  filter(sentiment == "positive") %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%y/%m/%d"))

sentiment_count %>%
  filter(sentiment == "negative") %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment), color = "#00CACA")+
  scale_x_date(labels = date_format("%y/%m/%d"))

找出負面情緒最高的三天分別是2020-05-28、2020-05-29

正面情緒在2020-05-29與2020-05-28也出現高峰，我們也來分析一下

sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%y/%m/%d"))+
  geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2020-05-28'))
[1]])),colour = "red", size = 0.5) +
  geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2020-05-29'))
[1]])),colour = "red", size = 0.5)

透過觀察情緒變化來回顧事件內容:負面詞彙高峰出現在’2020-05-28’

m %>% filter(artDate == as.Date('2020-05-28')) %>% distinct(artUrl, .keep_all = TRUE)

## # A tibble: 7 x 6
##   artTitle             artDate    artTime artUrl                word  count
##   <chr>                <date>     <time>  <chr>                 <chr> <dbl>
## 1 [新聞]振興券錢少又繁複馬:別因消費券… 2020-05-28 08:35   https://www.ptt.cc/b… 馬英九…    13
## 2 Re:[新聞]振興券錢少又繁複馬:別因… 2020-05-28 08:44   https://www.ptt.cc/b… 明明      1
## 3 Re:[新聞]振興券錢少又繁複馬:別因… 2020-05-28 08:57   https://www.ptt.cc/b… 消費      8
## 4 Re:[新聞]振興券錢少又繁複馬:別因… 2020-05-28 09:09   https://www.ptt.cc/b… 消費      2
## 5 Re:[新聞]振興券錢少又繁複馬:別因… 2020-05-28 09:12   https://www.ptt.cc/b… 消費      3
## 6 Re:[新聞]振興券錢少又繁複馬:別因… 2020-05-28 11:57   https://www.ptt.cc/b… 優惠券…     4
## 7 Re:[新聞]振興券錢少又繁複馬:別因… 2020-05-28 13:02   https://www.ptt.cc/b… 消費      7

 #tokens_count_by_date %>% 
 #  filter(artDate == as.Date('2020-05-28')) %>% 
 #  select(word,sum) %>% 
 #  group_by(word) %>% 
 #  summarise(count = sum(sum))  %>%
 #  filter(count>20) %>%   # 過濾出現太少次的字
 #  wordcloud2()

n_tokens %>% 
  filter(artDate == as.Date('2020-05-28')) %>%
  inner_join(LIWC) %>% 
  filter(sentiment == "negtive") %>% 
  group_by(artUrl,sentiment) %>% 
  summarise(
    artTitle = artTitle[1],
    count = n()
  ) %>% 
  arrange(desc(count))

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Warning: Factor `sentiment` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 1 x 4
## # Groups:   artUrl [1]
##   artUrl sentiment artTitle count
##   <chr>  <fct>     <chr>    <int>
## 1 <NA>   <NA>      <NA>         0

n_tokens %>%
  filter(artDate == as.Date('2020-05-28')) %>% 
  inner_join(LIWC) %>%
  group_by(word,sentiment) %>%
  summarise(
    count = n()
  ) %>% data.frame() %>% 
  top_n(30,wt = count) %>%
  ungroup() %>% 
  mutate(word = reorder(word, count)) %>%
  ggplot(aes(word, count, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "2020-05-28",
       x = NULL) +
  theme(text=element_text(size=14, family = "Heiti TC Light"))+
  coord_flip()

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

透過觀察情緒變化來回顧事件內容:負面詞彙第二高峰出現在’2020-05-29’

m %>% filter(artDate == as.Date('2020-05-29')) %>% distinct(artUrl, .keep_all = TRUE)

## # A tibble: 12 x 6
##    artTitle             artDate    artTime artUrl               word  count
##    <chr>                <date>     <time>  <chr>                <chr> <dbl>
##  1 [新聞]馬英九再籲改發消費券　蘇貞昌酸… 2020-05-29 01:43   https://www.ptt.cc/… 消費      8
##  2 [新聞]改發消費券？蘇貞昌：不重蹈馬英… 2020-05-29 02:01   https://www.ptt.cc/… 蘇貞昌…     4
##  3 Re:[新聞]馬英九再籲改發消費券　蘇… 2020-05-29 02:04   https://www.ptt.cc/… 消費      5
##  4 Re:[新聞]馬英九再籲改發消費券　蘇… 2020-05-29 02:32   https://www.ptt.cc/… 振興      5
##  5 Re:[新聞]馬英九再籲改發消費券　蘇… 2020-05-29 03:06   https://www.ptt.cc/… 振興      5
##  6 [新聞]蘇貞昌批馬政府消費券：對經濟成… 2020-05-29 03:21   https://www.ptt.cc/… 消費      5
##  7 [新聞]蘇揆嗆不重蹈消費券覆轍馬辦：放… 2020-05-29 04:00   https://www.ptt.cc/… 消費      8
##  8 Re:[新聞]馬英九再籲改發消費券　蘇… 2020-05-29 04:21   https://www.ptt.cc/… 政策      3
##  9 Re:[新聞]馬英九再籲改發消費券　蘇… 2020-05-29 09:09   https://www.ptt.cc/… 使用量…     5
## 10 [新聞]馬英九槓蘇貞昌不發消費券　「馬… 2020-05-29 09:40   https://www.ptt.cc/… 消費     24
## 11 Re:[新聞]馬英九槓蘇貞昌不發消費券… 2020-05-29 10:08   https://www.ptt.cc/… 大家      2
## 12 Re:[新聞]馬英九槓蘇貞昌不發消費券… 2020-05-29 11:58   https://www.ptt.cc/… 消費      4

 #tokens_count_by_date %>% 
 #  filter(artDate == as.Date('2020-05-29')) %>% 
 #  select(word,sum) %>% 
 #  group_by(word) %>% 
#   summarise(count = sum(sum))  %>%
#   filter(count>20) %>%   # 過濾出現太少次的字
#   wordcloud2()

n_tokens %>% 
  filter(artDate == as.Date('2020-05-29')) %>%
  inner_join(LIWC) %>% 
  filter(sentiment == "negtive") %>% 
  group_by(artUrl,sentiment) %>% 
  summarise(
    artTitle = artTitle[1],
    count = n()
  ) %>% 
  arrange(desc(count))

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Warning: Factor `sentiment` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 1 x 4
## # Groups:   artUrl [1]
##   artUrl sentiment artTitle count
##   <chr>  <fct>     <chr>    <int>
## 1 <NA>   <NA>      <NA>         0

n_tokens %>%
  filter(artDate == as.Date('2020-05-29')) %>% 
  inner_join(LIWC) %>%
  group_by(word,sentiment) %>%
  summarise(
    count = n()
  ) %>% data.frame() %>% 
  top_n(30,wt = count) %>%
  ungroup() %>% 
  mutate(word = reorder(word, count)) %>%
  ggplot(aes(word, count, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "2020-05-28t",
       x = NULL) +
  theme(text=element_text(size=14, family = "Heiti TC Light"))+
  coord_flip()

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

看回原本的情緒字數量圖，我們也來研究一下在2020年政府放出消費券消息前，為何仍有人再討論消費券?

sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%y/%m/%d"))

sentiment_count %>%
  filter(artDate < "2020-01-01") %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%y/%m/%d"))

## 2020年前的貼文詞頻

pc <- m %>%
  filter(artDate < "2020-01-01") %>%
  select(artDate, word, count)
plot_merge <- pc[order(-sentiment_count$count),] %>% 
  filter(artDate < "2020-01-01") %>%
  filter(word !="消費卷") %>%
   filter(word !="消費券") %>%
   filter(word !="振興卷") %>%
   filter(word !="振興券") %>%
   filter(word !="武漢肺炎") %>%
   filter(word !="武漢病毒") %>%
   filter(word !="酷碰券") %>%
   filter(word !="新冠肺炎")%>%
  #group_by(type) %>% 
  top_n(20, count) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, count)) %>%
  ggplot(aes(x=word, y=count)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y="詞頻") +
  theme(text=element_text(size=14, family = "Heiti TC Light"))+
  #facet_wrap(~type, ncol = 1, scales="free") + 
  coord_flip()

plot_merge

王鴻薇等等北市議員連署發放校費券活絡經濟

#找出每篇貼文的total
m_c_total <- m_comment %>%
  group_by(artUrl) %>%
  summarize(total = n())
#bm_comment與total合併
m_c_tfidf <- m_comment %>%
  select(word, artUrl, count) %>%
  left_join(m_c_total)

## Joining, by = "artUrl"

#製作tf-idf
m_c_tfidf <- m_c_tfidf %>%
  bind_tf_idf(word, count, total) %>%
  distinct()

## Warning in bind_tf_idf.data.frame(., word, count, total): A value for tf_idf is negative:
## Input should have exactly one row per document-term combination.

#找出前十tf-idf的詞
t <- m_c_tfidf %>%  group_by(artUrl) %>%
  top_n(10) %>%
  arrange(desc(artUrl)) %>%
  ungroup() %>%
  count(word, sort=TRUE)

## Selecting by tf_idf

共現關係圖

m_c_cor <- m_comment %>%
  group_by(word) %>%
  filter(n() > 3) %>%
  pairwise_cor(word, artUrl, sort = T)

m_c_cor %>%
  filter(item1 %in% c("支持", "絕望", "驚訝")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  theme(text = element_text(family = "Heiti TC Light"))+
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

## Selecting by correlation

# 顯示相關性大於0.55的組合
#set.seed(2020)

#m_c_cor %>%
##  filter(correlation > 0.55) %>%
#  graph_from_data_frame() %>%
#  ggraph(layout = "fr") +
##  theme(text = element_text(family = "Heiti TC Light"))+
#  geom_node_point(color = "lightblue", size = 3) +
#  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
#  theme_void()

資料轉換為DTM

m_dtm <- m %>% cast_dtm(artUrl, word, count)
m_dtm

## <<DocumentTermMatrix (documents: 240, terms: 6056)>>
## Non-/sparse entries: 15432/1438008
## Sparsity           : 99%
## Maximal term length: 9
## Weighting          : term frequency (tf)

inspect(m_dtm[1:10,1:10])

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 34/66
## Sparsity           : 66%
## Maximal term length: 3
## Weighting          : term frequency (tf)
## Sample             :
##                                                           Terms
## Docs                                                       八卦 大家 當年
##   https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html    1    1    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484704029.A.D83.html    0    0    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484796648.A.AC8.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484809083.A.979.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484824813.A.E19.html    0    1    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484825377.A.B6D.html    0    1    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484920512.A.E03.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1485424627.A.05E.html    0    1    0
##   https://www.ptt.cc/bbs/Gossiping/M.1489399885.A.E0F.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1490027352.A.ED0.html    1    1    1
##                                                           Terms
## Docs                                                       發現 剛剛 今天
##   https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html    1    1    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484704029.A.D83.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484796648.A.AC8.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484809083.A.979.html    0    0    2
##   https://www.ptt.cc/bbs/Gossiping/M.1484824813.A.E19.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484825377.A.B6D.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484920512.A.E03.html    0    0    1
##   https://www.ptt.cc/bbs/Gossiping/M.1485424627.A.05E.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1489399885.A.E0F.html    0    0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1490027352.A.ED0.html    0    0    0
##                                                           Terms
## Docs                                                       年前 消費
##   https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html    1    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484704029.A.D83.html    0    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484796648.A.AC8.html    0    7
##   https://www.ptt.cc/bbs/Gossiping/M.1484809083.A.979.html    0    7
##   https://www.ptt.cc/bbs/Gossiping/M.1484824813.A.E19.html    1    2
##   https://www.ptt.cc/bbs/Gossiping/M.1484825377.A.B6D.html    0    3
##   https://www.ptt.cc/bbs/Gossiping/M.1484920512.A.E03.html    0    6
##   https://www.ptt.cc/bbs/Gossiping/M.1485424627.A.05E.html    0    1
##   https://www.ptt.cc/bbs/Gossiping/M.1489399885.A.E0F.html    0    2
##   https://www.ptt.cc/bbs/Gossiping/M.1490027352.A.ED0.html    0    2
##                                                           Terms
## Docs                                                       有沒有 怎麼
##   https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html      1    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484704029.A.D83.html      0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484796648.A.AC8.html      0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484809083.A.979.html      0    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484824813.A.E19.html      0    1
##   https://www.ptt.cc/bbs/Gossiping/M.1484825377.A.B6D.html      0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1484920512.A.E03.html      0    0
##   https://www.ptt.cc/bbs/Gossiping/M.1485424627.A.05E.html      0    1
##   https://www.ptt.cc/bbs/Gossiping/M.1489399885.A.E0F.html      1    0
##   https://www.ptt.cc/bbs/Gossiping/M.1490027352.A.ED0.html      1    0

建立LDA模型

lda <- LDA(m_dtm, k = 2, control = list(seed = 2020))

ϕ Matrix

查看ϕ matrix (topic * term)

#topics <- tidy(lda, matrix = "beta") 
#注意，在tidy function裡面要使用"beta"來取出Phi矩陣。
#topics

Topic代表字

#top_terms <- topics %>%
#  group_by(topic) %>%
#  top_n(10, beta) %>%
#  ungroup() %>%
#  arrange(topic, -beta)


#top_terms %>%
#  mutate(term = reorder(term, beta)) %>%
#  ggplot(aes(term, beta, fill = factor(topic))) +
#  geom_col(show.legend = FALSE) +
#  theme(text = element_text(family = "Heiti TC Light"))+
#  facet_wrap(~ topic, scales = "free") +
#  coord_flip()

建立LDA模型-3 topics

lda <- LDA(m_dtm, k = 3, control = list(seed = 2020))

ϕ Matrix

查看ϕ matrix (topic * term)

topics <- tidy(lda, matrix = "beta") # 注意，在tidy function裡面要使用"beta"來取出Phi矩陣。
topics

## # A tibble: 18,168 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 剛剛  1.16e-  3
##  2     2 剛剛  1.20e- 27
##  3     3 剛剛  1.89e- 25
##  4     1 發現  1.20e-  3
##  5     2 發現  2.30e-  3
##  6     3 發現  9.60e-  4
##  7     1 年前  9.83e-  4
##  8     2 年前  1.92e-  4
##  9     3 年前  8.11e-141
## 10     1 今天  1.71e-  3
## # … with 18,158 more rows

Topic代表字

top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  theme(text = element_text(family = "Heiti TC Light"))+
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

三個主題比較看得出差別。

產生create LDAvis所需的json function

此function是將前面使用 “LDA function”所建立的model，轉換為“LDAVis”套件的input格式。

topicmodels_json_ldavis <- function(fitted, doc_term){
    require(LDAvis)
    require(slam)
  
   ###以下function 用來解決，主題數多會出現NA的問題
  ## 參考 https://github.com/cpsievert/LDAvis/commit/c7234d71168b1e946a361bc00593bc5c4bf8e57e
ls_LDA = function (phi)
{
  jensenShannon <- function(x, y) {
      m <- 0.5 * (x + y)
    lhs <- ifelse(x == 0, 0, x * (log(x) - log(m+1e-16)))
    rhs <- ifelse(y == 0, 0, y * (log(y) - log(m+1e-16)))
    0.5 * sum(lhs) + 0.5 * sum(rhs)
  }
  dist.mat <- proxy::dist(x = phi, method = jensenShannon)
  pca.fit <- stats::cmdscale(dist.mat, k = 2)
  data.frame(x = pca.fit[, 1], y = pca.fit[, 2])
}

    # Find required quantities
    phi <- as.matrix(posterior(fitted)$terms)
    theta <- as.matrix(posterior(fitted)$topics)
    vocab <- colnames(phi)
    term_freq <- slam::col_sums(doc_term)

    # Convert to json
    json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                            vocab = vocab,
                            doc.length = as.vector(table(doc_term$i)),
                            term.frequency = term_freq, mds.method = ls_LDA)

    return(json_lda)
}

LDA後續分析

根據前面的探索之後，我們對於資料有更加了解，並且看完每個主題數的LDAvis之後，選定主題數3的結果來作後續的分析。

m_lda = ldas[[2]] ## 選定topic 為3 的結果

topics <- tidy(m_lda, matrix = "beta") # 注意，在tidy function裡面要使用"beta"來取出Phi矩陣。
topics

## # A tibble: 18,168 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 剛剛  1.16e-  3
##  2     2 剛剛  1.20e- 27
##  3     3 剛剛  1.89e- 25
##  4     1 發現  1.20e-  3
##  5     2 發現  2.30e-  3
##  6     3 發現  9.60e-  4
##  7     1 年前  9.83e-  4
##  8     2 年前  1.92e-  4
##  9     3 年前  8.11e-141
## 10     1 今天  1.71e-  3
## # … with 18,158 more rows

尋找Topic的代表字

整理出每一個Topic中生成概率最高的10個詞彙。

top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  theme(text = element_text(family = "Heiti TC Light"))+
  geom_col(show.legend = FALSE) + 
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

移除常出現、跨主題共享的詞彙。

remove_word = c("消費","發放","政府","經濟")
top_terms <- topics %>%
  filter(!term  %in% remove_word)%>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  theme(text = element_text(family = "Heiti TC Light"))+
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

主題命名

topic_name = c("古今比較","疫情相關","旅遊相關")

Document 主題分佈

# for every document we have a probability distribution of its contained topics
tmResult <- posterior(m_lda)
doc_pro <- tmResult$topics 
dim(doc_pro)               # nDocs(DTM) distributions over K topics

## [1] 240   3

每篇文章都有topic的分佈，所以240筆的文章*3個主題

cbind Document 主題分佈

# get document topic proportions 
document_topics <- doc_pro[m$artUrl,]
document_topics_df =data.frame(document_topics)
colnames(document_topics_df) = topic_name
rownames(document_topics_df) = NULL
m_topic = cbind(m,document_topics_df)
# m_topic %>% head(10)

查看特定主題的文章

m_topic %>%
    arrange(desc(`古今比較`)) %>%head(10)

##                                        artTitle    artDate  artTime
## 1  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 2  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 3  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 4  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 5  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 6  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 7  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 8  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 9  [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
## 10 [新聞]消費券買電腦奠基礎　高中少女擁30UP證照 2018-01-06 16:43:19
##                                                      artUrl   word count
## 1  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   電腦    11
## 2  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   競賽     6
## 3  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html 蔡慧玲     5
## 4  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   全國     5
## 5  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   技能     5
## 6  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   證照     4
## 7  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   應用     4
## 8  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   希望     4
## 9  https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   未來     4
## 10 https://www.ptt.cc/bbs/Gossiping/M.1515286161.A.057.html   完整     3
##     古今比較     疫情相關     旅遊相關
## 1  0.9997869 0.0001065682 0.0001065682
## 2  0.9997869 0.0001065682 0.0001065682
## 3  0.9997869 0.0001065682 0.0001065682
## 4  0.9997869 0.0001065682 0.0001065682
## 5  0.9997869 0.0001065682 0.0001065682
## 6  0.9997869 0.0001065682 0.0001065682
## 7  0.9997869 0.0001065682 0.0001065682
## 8  0.9997869 0.0001065682 0.0001065682
## 9  0.9997869 0.0001065682 0.0001065682
## 10 0.9997869 0.0001065682 0.0001065682

了解主題在時間的變化

# m_topic[,c(7:16)] =sapply(m_topic[,c(7:16)] , as.numeric)
m_topic %>% 
  group_by(artDate = format(artDate,'%Y%m')) %>%
  dplyr::select(-count)%>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
  melt(id.vars = "artDate")%>%
 ggplot( aes(x=artDate, y=value, fill=variable)) + 
  theme(text = element_text(family = "Heiti TC Light"))+
  geom_bar(stat = "identity") + ylab("value") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

- 取資料量多的月份

m_topic %>%
  filter( format(artDate,'%Y%m') %in% c(201802,202004,202005))%>%
  dplyr::select(-count)%>%
  group_by(artDate = format(artDate,'%Y%m')) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
  melt(id.vars = "artDate")%>%
 ggplot( aes(x=artDate, y=value, fill=variable)) + 
  theme(text = element_text(family = "Heiti TC Light"))+
  geom_bar(stat = "identity") + ylab("value") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

以比例了解主題時間變化

m_topic %>%
  filter( format(artDate,'%Y%m') %in% c(201802,202004,202005))%>%
  dplyr::select(-count)%>%
  group_by(artDate = format(artDate,'%Y%m')) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
melt(id.vars = "artDate")%>%
  group_by(artDate)%>%
  mutate(total_value =sum(value))%>%
 ggplot( aes(x=artDate, y=value/total_value, fill=variable)) + 
  theme(text = element_text(family = "Heiti TC Light"))+
  geom_bar(stat = "identity") + ylab("proportion") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

看得出2018年討論熱度幾乎圍繞旅遊相關，可能是因為當時花蓮縣長求助中央發放消費券振興花蓮旅遊產業，且2018提及消費券時，應多圍繞著其是否真正能幫助產業，而少與過去比較；2020年則較多提及疫情及比較古今。

載入PTT資料

# 文章資料
m <- read_csv("./data/m_articleMetaData.csv")
head(m)

## # A tibble: 6 x 10
##   artTitle artDate    artTime artUrl artPoster artCat commentNum  push
##   <chr>    <date>     <time>  <chr>  <chr>     <chr>       <dbl> <dbl>
## 1 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi…         38    22
## 2 Re:[問卦]… 2017-01-17 17:41   https… VVizZ     Gossi…          3     0
## 3 [新聞]稅收超… 2017-01-18 19:24   https… nokia6660 Gossi…        145    23
## 4 [新聞]溢收賦… 2017-01-18 22:51   https… cc9i      Gossi…         61    32
## 5 [問卦]平心而… 2017-01-19 03:14   https… ipadmini6 Gossi…         71    30
## 6 Re:[問卦]… 2017-01-19 03:23   https… tagso     Gossi…         63    10
## # … with 2 more variables: boo <dbl>, sentence <chr>

# 回覆資料
mr <- read_csv("./data/m_articleReviews.csv")
head(mr)

## # A tibble: 6 x 10
##   artTitle artDate    artTime artUrl artPoster artCat cmtPoster cmtStatus
##   <chr>    <date>     <time>  <chr>  <chr>     <chr>  <chr>     <chr>    
## 1 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… wensandra 推       
## 2 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
## 3 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… jffry6663 →        
## 4 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… js52666   →        
## 5 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… wensandra →        
## 6 [問卦]有沒有… 2017-01-17 17:37   https… kmjhome   Gossi… ah937609  →        
## # … with 2 more variables: cmtDate <dttm>, cmtContent <chr>

# 選取需要的欄位
mr <- mr %>%
      select(artUrl, cmtPoster, cmtStatus, cmtContent)
head(mr)

## # A tibble: 6 x 4
##   artUrl                                 cmtPoster cmtStatus cmtContent    
##   <chr>                                  <chr>     <chr>     <chr>         
## 1 https://www.ptt.cc/bbs/Gossiping/M.14… wensandra 推        :之前嫖一次剛剛好…
## 2 https://www.ptt.cc/bbs/Gossiping/M.14… js52666   →         :一樓拿去買假屌…
## 3 https://www.ptt.cc/bbs/Gossiping/M.14… jffry6663 →         :買鞋子幹     
## 4 https://www.ptt.cc/bbs/Gossiping/M.14… js52666   →         :改五樓還拿來自肛…
## 5 https://www.ptt.cc/bbs/Gossiping/M.14… wensandra →         :?            
## 6 https://www.ptt.cc/bbs/Gossiping/M.14… ah937609  →         :買天幣

資料預覽

m %>% 
  group_by(artDate) %>%
  summarise(count = n())%>%
  ggplot(aes(artDate,count))+
    geom_line(color="blue", size=1)+
    theme_classic()

發文者數量

length(unique(m$artPoster))

## [1] 219

回覆者數量

length(unique(mr$cmtPoster))

## [1] 6209

總共有參與的人數

allPoster <- c(m$artPoster, mr$cmtPoster)
length(unique(allPoster))

## [1] 6353

整理所有參與人

# 整理所有出現過得使用者
# 如果它曾發過文的話就標註他爲poster
# 如果沒有發過文的話則標註他爲replyer
userList <- data.frame(user=unique(allPoster)) %>%
              mutate(type=ifelse(user%in%m$artPoster, "poster", "replyer"))
head(userList)

##        user   type
## 1   kmjhome poster
## 2     VVizZ poster
## 3 nokia6660 poster
## 4      cc9i poster
## 5 ipadmini6 poster
## 6     tagso poster

建立社群網路圖

將原文與回覆Join起來

# 把原文與回覆依據artUrl innerJoin起來
posts_Reviews <- merge(x = m, y = mr, by = "artUrl")
head(posts_Reviews)

##                                                     artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 4 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 5 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 6 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
##                             artTitle    artDate  artTime artPoster
## 1 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 2 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 3 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 4 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 5 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 6 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
##      artCat commentNum push boo
## 1 Gossiping         38   22   2
## 2 Gossiping         38   22   2
## 3 Gossiping         38   22   2
## 4 Gossiping         38   22   2
## 5 Gossiping         38   22   2
## 6 Gossiping         38   22   2
##                                                                                                 sentence
## 1 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 2 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 3 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 4 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 5 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 6 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
##   cmtPoster cmtStatus        cmtContent
## 1 wensandra        推 :之前嫖一次剛剛好
## 2   js52666        →   :一樓拿去買假屌
## 3 jffry6663        →         :買鞋子幹
## 4   js52666        → :改五樓還拿來自肛
## 5 wensandra        →                :?
## 6  ah937609        →           :買天幣

篩選欄位

# 取出 cmtPoster(回覆者)、artPoster(發文者)、artUrl(文章連結) 三個欄位
link <- posts_Reviews %>%
      select(cmtPoster, artPoster, artUrl)
head(link)

##   cmtPoster artPoster
## 1 wensandra   kmjhome
## 2   js52666   kmjhome
## 3 jffry6663   kmjhome
## 4   js52666   kmjhome
## 5 wensandra   kmjhome
## 6  ah937609   kmjhome
##                                                     artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 4 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 5 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 6 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html

建立網路關係

reviewNetwork <- graph_from_data_frame(d=link, directed=T)

網路圖

# 畫出網路圖
plot(reviewNetwork)

可以發現密密麻麻的東西，完全看不出個所以然。我們試着放少一點的東西來試試。

調整參數

# 把點點的大小和線的粗細調小，並不顯示使用者賬號。
plot(reviewNetwork, vertex.size=2, edge.arrow.size=.2,vertex.label=NA)

還是沒什麼好解釋的，我們試着縮小文章的數量看看。

資料篩選

挑出2020-05-29當天的文章和它的回覆

link <- posts_Reviews %>%
      filter(artDate == as.Date('2020-05-29')) %>%
      select(cmtPoster, artPoster, artUrl) %>% 
      unique()
head(link)

##      cmtPoster artPoster
## 1    IDfor2010     runa2
## 2     felixden     runa2
## 3    royroy666     runa2
## 4       albert     runa2
## 5 forhorde5566     runa2
## 8    NTULioner     runa2
##                                                     artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1590716618.A.8EC.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1590716618.A.8EC.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1590716618.A.8EC.html
## 4 https://www.ptt.cc/bbs/Gossiping/M.1590716618.A.8EC.html
## 5 https://www.ptt.cc/bbs/Gossiping/M.1590716618.A.8EC.html
## 8 https://www.ptt.cc/bbs/Gossiping/M.1590716618.A.8EC.html

過濾圖中的點(v)

# 這邊要篩選link中有出現的使用者
# 因爲如果userList（igraph中graph_from_data_frame的v參數吃的那個東西）中出現了沒有在link中出現的使用者
# 也會被igraph畫上去，圖片就會變得沒有意義
# 想要看會變怎麼樣的人可以跑一下下面的code
filtered_user <- userList %>%
          filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
          arrange(desc(type))
head(filtered_user)

##          user    type
## 1     eric999 replyer
## 2      fcz973 replyer
## 3 ChungLi5566 replyer
## 4  carlos5978 replyer
## 5       amury replyer
## 6       hosen replyer

未過濾使用者

## 警告！有密集恐懼症的人請小心使用
#reviewNetwork <- graph_from_data_frame(d=link, v=userList, directed=T)
#plot(reviewNetwork, vertex.size=3, edge.arrow.size=0.3,vertex.label=NA)

過濾使用者後

set.seed(487)
# 建立網路關係圖，因爲剛剛看的時候感覺箭頭有點礙眼，
# 所以這裏我們先把關係的方向性拿掉，減少圖片中的不必要的資訊
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
plot(reviewNetwork, vertex.size=3, edge.arrow.size=0.3,vertex.label=NA)

可以稍微看出圖中的點(人)之間有一定的關聯，不過目前只有單純圖形我們無法分析其中的內容。
因此以下我們將資料集中的資訊加到我們的圖片中。

加強圖像的顯示資訊(1)

set.seed(487)
# 用使用者的身份來區分點的顏色，如果有發文的話是金色的，只有回覆文章的則用淺藍色表示
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
plot(reviewNetwork, vertex.size=3, edge.arrow.size=0.3,vertex.label=NA)

加強圖像的顯示資訊(2)

set.seed(487)
# 篩選要顯示出的使用者，以免圖形被密密麻麻的文字覆蓋
# 顯示有超過5個關聯的使用者賬號
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)

V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
plot(reviewNetwork, vertex.size=3, edge.arrow.size=0.2,
     vertex.label=ifelse(degree(reviewNetwork) > 5, V(reviewNetwork)$label, NA),  vertex.label.font=2)

我們可以看到基本的使用者關係，但是我們希望能夠將更進階的資訊視覺化。
例如：使用者經常參與的文章種類，或是使用者在該社群網路中是否受到歡迎。

前處理

# 文章斷句
m_meta <- m %>%
              mutate(sentence=gsub("[\n]{2,}", "。", sentence))
# 以全形或半形 驚歎號、問號、分號 以及 全形句號 爲依據進行斷句
m_sentences <- strsplit(m_meta$sentence,"[。！；？!?;]")
# 將每句句子，與他所屬的文章連結配對起來，整理成一個dataframe
m_sentences <- data.frame(
                        artUrl = rep(m_meta$artUrl, sapply(m_sentences, length)), 
                        sentence = unlist(m_sentences)
                      ) %>%
                      filter(!str_detect(sentence, regex("^(\t|\n| )*$")))
m_sentences$sentence <- as.character(m_sentences$sentence)
head(m_sentences)

##                                                     artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1484704029.A.D83.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1484704029.A.D83.html
## 4 https://www.ptt.cc/bbs/Gossiping/M.1484704029.A.D83.html
## 5 https://www.ptt.cc/bbs/Gossiping/M.1484796648.A.AC8.html
## 6 https://www.ptt.cc/bbs/Gossiping/M.1484796648.A.AC8.html
##                                                                                             sentence
## 1 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ
## 2    這我知道\n當年小折瘋\nKHS有出一台P-3600\n就是因為消費券的關係\n還挺不綽的\n一台也很乾脆訂價7200
## 3                                                             然後我就多花了3600\n最後退燒了只賣5500
## 4                                                                                                 幹
## 5                                                                                   1.媒體來源:\nUdn
## 6                                                 2.完整新聞標題:\n稅收超收4000億 藍營議員籲發消費券

## 文章斷詞
# load mask_lexicon
m_lexicon <- scan(file = "ticket_lexicon.txt", what=character(),sep='\n', 
                   encoding='utf-8',fileEncoding='utf-8')
# load stop words
stop_words <- scan(file = "stop_words.txt", what=character(),sep='\n', 
                   encoding='utf-8',fileEncoding='utf-8')
# 使用默認參數初始化一個斷詞引擎
jieba_tokenizer = worker()
# 使用口罩字典重新斷詞
new_user_word(jieba_tokenizer, c(m_lexicon))

## [1] TRUE

# tokenize function
chi_tokenizer <- function(t) {
  lapply(t, function(x) {
    if(nchar(x)>1){
      tokens <- segment(x, jieba_tokenizer)
      tokens <- tokens[!tokens %in% stop_words]
      # 去掉字串長度爲1的詞彙
      tokens <- tokens[nchar(tokens)>1]
      return(tokens)
    }
  })
}
tokens <- m_sentences %>%
  unnest_tokens(word, sentence, token=chi_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
  count(artUrl, word) %>%
  rename(count=n)
head(tokens)

## # A tibble: 6 x 3
##   artUrl                                                   word  count
##   <fct>                                                    <chr> <int>
## 1 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html 八卦      1
## 2 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html 八年      1
## 3 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html 當年      1
## 4 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html 發現      1
## 5 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html 父母      1
## 6 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html 剛剛      1

## 清理斷詞結果
# 挑出總出現次數大於3的字
reserved_word <- tokens %>% 
  group_by(word) %>% 
  count() %>% 
  filter(n > 3) %>% 
  unlist()

m_removed <- tokens %>% 
  filter(word %in% reserved_word)

m_dtm <- m_removed %>% cast_dtm(artUrl, word, count)
m_dtm

## <<DocumentTermMatrix (documents: 240, terms: 815)>>
## Non-/sparse entries: 7970/187630
## Sparsity           : 96%
## Maximal term length: 6
## Weighting          : term frequency (tf)

LDA 主題分析

# LDA分成4個主題
m_lda <- LDA(m_dtm, k = 4, control = list(seed = 1234))
# 看各群的常用詞彙
tidy(m_lda, matrix = "beta") %>%
  filter(! term %in% c("振興券")) %>% 
  filter(! term %in% c("消費券")) %>% 
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = as.factor(topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  theme(text = element_text(family = "Heiti TC Light")) +
  coord_flip() +
  scale_x_reordered()

可以歸納出
topic 1 = “消費券政策”
topic 2 “政府政策”
topic 3 = “觀光補助相關”
topic 4 = 還要再細分
以下我們挑出第一個主題與第三個主題來做比較。

# 使用LDA分類每篇文章的主題
m_topics <- tidy(m_lda, matrix="gamma") %>% # 在tidy function中使用參數"gamma"來取得 theta矩陣。
                  group_by(document) %>%
                  top_n(1, wt=gamma)
head(m_topics)

## # A tibble: 6 x 3
## # Groups:   document [6]
##   document                                                 topic gamma
##   <chr>                                                    <int> <dbl>
## 1 https://www.ptt.cc/bbs/Gossiping/M.1586234085.A.86F.html     1 0.985
## 2 https://www.ptt.cc/bbs/Gossiping/M.1586240778.A.B43.html     1 0.548
## 3 https://www.ptt.cc/bbs/Gossiping/M.1586311835.A.55F.html     1 0.669
## 4 https://www.ptt.cc/bbs/Gossiping/M.1586315265.A.8A0.html     1 0.513
## 5 https://www.ptt.cc/bbs/Gossiping/M.1586925646.A.E6B.html     1 0.984
## 6 https://www.ptt.cc/bbs/Gossiping/M.1587101544.A.F1A.html     1 0.999

LDA主題進行視覺化

# 把文章資訊和主題join起來
posts_Reviews <- merge(x = posts_Reviews, y = m_topics, by.x = "artUrl", by.y="document")
head(posts_Reviews)

##                                                     artUrl
## 1 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 2 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 3 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 4 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 5 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
## 6 https://www.ptt.cc/bbs/Gossiping/M.1484703813.A.B6F.html
##                             artTitle    artDate  artTime artPoster
## 1 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 2 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 3 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 4 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 5 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
## 6 [問卦]有沒有當年消費券怎麼用的八卦 2017-01-17 17:37:27   kmjhome
##      artCat commentNum push boo
## 1 Gossiping         38   22   2
## 2 Gossiping         38   22   2
## 3 Gossiping         38   22   2
## 4 Gossiping         38   22   2
## 5 Gossiping         38   22   2
## 6 Gossiping         38   22   2
##                                                                                                 sentence
## 1 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 2 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 3 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 4 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 5 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
## 6 剛剛才發現\n八年前的今天是發消費券\n有沒有大家當年都怎麼用的八卦\n想當年那3600還是進了父母的口袋QQ\n\n
##   cmtPoster cmtStatus        cmtContent topic     gamma
## 1 wensandra        推 :之前嫖一次剛剛好     4 0.9771259
## 2   js52666        →   :一樓拿去買假屌     4 0.9771259
## 3 jffry6663        →         :買鞋子幹     4 0.9771259
## 4   js52666        → :改五樓還拿來自肛     4 0.9771259
## 5 wensandra        →                :?     4 0.9771259
## 6  ah937609        →           :買天幣     4 0.9771259

# 挑選出2020/05/01後的文章，
# 篩選有在10篇以上文章回覆者，
# 文章主題歸類為1(消費券政策)與3(觀光補助相關)者，
# 欄位只取：cmtPoster(評論者), artPoster(發文者), artUrl(文章連結), topic(主題)
link <- posts_Reviews %>%
      filter(artDate > as.Date('2020-05-01')) %>%
      group_by(cmtPoster, artUrl) %>% 
      filter(n()>2) %>% 
      ungroup() %>% 
      filter(topic == 1 | topic == 2) %>% 
      select(cmtPoster, artPoster, artUrl, topic) %>% 
      unique()
head(link)

## # A tibble: 6 x 4
##   cmtPoster    artPoster artUrl                                       topic
##   <chr>        <chr>     <chr>                                        <int>
## 1 ArimuraChika ReposJob  https://www.ptt.cc/bbs/Gossiping/M.15889144…     2
## 2 ssuchi1222   ReposJob  https://www.ptt.cc/bbs/Gossiping/M.15889144…     2
## 3 brockqq      ReposJob  https://www.ptt.cc/bbs/Gossiping/M.15889144…     2
## 4 whitezealman ova132132 https://www.ptt.cc/bbs/Gossiping/M.15890013…     1
## 5 zalora       zalora    https://www.ptt.cc/bbs/Gossiping/M.15890823…     2
## 6 bluewinds    fur       https://www.ptt.cc/bbs/Gossiping/M.15901224…     2

# 篩選link中有出現的使用者
filtered_user <- userList %>%
          filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
          arrange(desc(type))
head(filtered_user)

##          user    type
## 1     eric999 replyer
## 2 Tchachavsky replyer
## 3    stvn2567 replyer
## 4   NTULioner replyer
## 5     ru04hj4 replyer
## 6   boogieman replyer

使用者經常參與的文章種類

# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)

# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")

# 依據回覆發生的文章所對應的主題，對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$topic == "1", "lightgreen", "palevioletred")

# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=3, edge.width=3, vertex.label.dist=1,
     vertex.label=ifelse(degree(reviewNetwork) > 5, V(reviewNetwork)$label, NA),vertex.label.font=2)

# 加入標示
par(family = "Heiti TC Light")
legend("bottomright", c("發文者","回文者"), pch=21, 
  col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("消費券政策","觀光補助相關"), 
       col=c("lightgreen","palevioletred"), lty=1, cex=1, )

可以看出帳號“wyin”發的文大多在討論消費券政策相關的主題，我們查找後發現該作者發的都是新聞，引起熱烈回應。

使用者是否受到歡迎

# PTT的回覆有三種，推文、噓文、箭頭
# 我們只要看推噓就好，因此把箭頭清掉
link <- posts_Reviews %>%
      filter(cmtStatus!="→") %>%
      group_by(cmtPoster, artUrl) %>% 
      filter(n()>2) %>% 
      ungroup() %>% 
      select(cmtPoster, artPoster, artUrl, cmtStatus) %>% 
      unique()

# 接下來把網路圖畫出來，跟前面做的事都一樣，因此不再細述

# 篩選link中有出現的使用者
filtered_user <- userList %>%
          filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
          arrange(desc(type))

# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)

# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")


# 依據回覆發生的文章所對應的主題，對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$cmtStatus == "推", "lightgreen", "palevioletred")

# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=2, edge.width=3, vertex.label.dist=1,
     vertex.label=ifelse(degree(reviewNetwork) > 3, V(reviewNetwork)$label, NA),vertex.label.font=2)

# 加入標示
par(family = "Heiti TC Light")
legend("bottomright", c("發文者","回文者"), pch=21,
  col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("推","噓"), 
       col=c("lightgreen","palevioletred"), lty=1, cex=1)

推噓數差不多，但是從情緒分析我們發現評論大多對於消費卷持負面看法，因此我們認為推噓沒辦法看出回應者對於消費卷的正負面看法。

消費券、振興券討論分析

第11組 組員：林桂華、莊子霆、王子菁、陳思涵

6/15/2020

系統參數設定

安裝需要的packages

預覽資料

斷詞、停用詞使用

計算詞頻

情緒分析

準備LIWC字典

與LIWC情緒字典做join

計算所有字在文集中的總詞頻

文字雲

長條圖

每月貼文直方圖

查看篇數最多的三個月，最常出現的詞彙

每月留言直方圖

留言情緒分析

統計每天的文章正面字的次數與負面字的次數

透過觀察情緒變化來回顧事件內容:負面詞彙高峰出現在’2020-05-28’

透過觀察情緒變化來回顧事件內容:負面詞彙第二高峰出現在’2020-05-29’

看回原本的情緒字數量圖，我們也來研究一下在2020年政府放出消費券消息前，為何仍有人再討論消費券?

共現關係圖

資料轉換為DTM

建立LDA模型

ϕ Matrix

查看ϕ matrix (topic * term)

Topic代表字

更多主題

每個主題的結果

透過perplexity找到最佳主題數

建立LDA模型-3 topics

ϕ Matrix

查看ϕ matrix (topic * term)

Topic代表字

產生create LDAvis所需的json function

LDA後續分析

尋找Topic的代表字

主題命名

Document 主題分佈

cbind Document 主題分佈

查看特定主題的文章

了解主題在時間的變化

載入PTT資料

資料預覽

發文者數量

回覆者數量

總共有參與的人數

整理所有參與人

建立社群網路圖

將原文與回覆Join起來

篩選欄位

建立網路關係

網路圖

調整參數

資料篩選

挑出2020-05-29當天的文章和它的回覆

過濾圖中的點(v)

未過濾使用者

過濾使用者後

加強圖像的顯示資訊(1)

加強圖像的顯示資訊(2)

前處理

LDA 主題分析

LDA主題進行視覺化

使用者經常參與的文章種類

使用者是否受到歡迎

結論一:2009年發放消費券後，這個政策常常被鄉民拿來與其他政策做比較，像是“做某某政策不如拿來發消費券”這種討論，因此在今年發放振興券前，仍然有討論的熱度

結論二:雖然鄉民常常拿消費券的政策來與其他的政策做比較，但從資料中可以看到，不論是消費券還是振興券，鄉民似乎都保持一個比較負面的態度

結論三：經由LDA分析，我們發現因為疫情關係，政府想要刺激消費，促進經濟。在討論時就很容易出現和疫情相關的詞彙，也很常拿來和之前的消費券做比較

第11組組員：林桂華、莊子霆、王子菁、陳思涵