TIGCR Opinion Analysis

Work Requirement

一、工作事項與需要能力
1.斷詞分析(Segment analysis)：

請將臉書各則貼文之「文字敘述」，切割為「詞」（最小有意義且可以自由使用的語言單位）。

1. 熟悉斷詞模型，包括（但不限於）：最大概率法( MPSegment )、隱式馬爾科夫模型( HMMSegment )、混合模型( MixSegment )、索引模型( QuerySegment )    
2. 熟悉文字探勘處理套件，包括（但不限於）：tm、tidy、jiebaR

2.詞頻（關鍵字）統計：請依據本研究提供之關鍵字列表，統計各個關鍵字在各則貼文中出現的「次數」。
→舉例：
本研究給定關鍵字「我們」；希望回傳檔案能在後方加入一欄位，統計出各則貼文所出現的關鍵字「我們」次數。
例如：第一、第二、第三則貼文文字敘述中的「我們」，各出現0、0、2次。

3.詞頻（關鍵字）是否出現：請依據本研究提供之關鍵字列表，標記各則貼文中有無出現各個關鍵字。 →舉例：本研究給定關鍵字「我們」；希望回傳檔案能在後方加入一欄位，標記各則貼文有無出現關鍵字「我們」。
✓ 例如：第一、第二則貼文文字敘述中，沒有出現「我們」關鍵字（標記為0）；第三則貼文文字敘述中，有出現「我們」關鍵字（標記為1）。

4.情緒分析 (Sentiment analysis)：請依據本研究提供之情緒詞關鍵字字典，將各則貼文中出現的「正負情緒詞」進行權重計算，統計各則貼文的情緒值。
✓ 需要能去除停用詞及分詞
✓ 需要能套入指定字典之詞彙
✓ 需要能針對字典中的詞彙進行正負評價加權

5.依候選人，將各類目關鍵字出現次數加總：請依據候選人，將類目所有出現的關鍵字次數進行加總。
→舉例：本研究給定類目「1-01提及自己名字」總共有「柯文哲」、「文哲」、「阿北」三個關鍵字；希望回傳檔案能在「柯文哲」、「姚文智」等14位候選人類目後方均加入一欄位，標記「各關鍵字出現次數總和」、「各關鍵字出現種類總和」。  例如：在所有柯文哲粉絲專頁貼文，出現0次「柯文哲」、4次「柯P」、1次「阿北」，則標註「各關鍵字出現次數總和」標記為 5 (= 0+4+1)、「各關鍵字出現種類總和」標記為 2 (共兩種關鍵字)。

Loading Packages

library(xlsx)
library(readr)
library(dplyr)
library(jiebaR)
library(tidyr)
library(tidytext)
library(topicmodels)
library(stringr)
library(ggplot2)
library(wordcloud2)
library(data.table)
library(reshape2)
library(scales)
library(tm)
library(tmcn)
library(lubridate)

Loading Datasets

# 7月份資料
july_data <- read.xlsx("data/07.xlsx", sheetIndex = 1, encoding = "UTF-8")
# 8月份資料
# aug_data <- read.xlsx("data/08.xlsx", sheetIndex = 1, encoding = "UTF-8")

Data cleasing

只篩選政治人物自己的貼文，並挑選欄位:
- url - content - topicBy

colnames(july_data)[1:5] <- c("note", "dpm", "day", "mid", "id")
july <- july_data[is.na(july_data$note),] %>% 
  select(url, content, topicBy)

碰到第一個問題：
表情符號會是亂碼，且不是分析標的，要把emoji表情符號過濾掉
參考資料：所有表情符號的unicode

刪除表情符號後，還有一些內容是原始檔案就有的編碼錯誤，如：類似這種亂碼，把它排除。

[stackoverflow]Remove a string starting with or in R 類似的解答，但不適用於中文編碼
[RPubs]Remove non UTF-8 characters from text

july_sentences <-  july %>% 
  # remove emoji
  mutate(sentence = gsub("(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])", "", content)) %>% 
  # remove non UTF-8
  mutate(sentence = gsub("[^[:alnum:][:blank:]?&/\\-]", "", sentence))

# july_sentences <- strsplit(july_sentences$sentence, "[。！；？!?;]")
# 
# july_sentences <- data.frame(
#   url = rep(july$url, sapply(july_sentences, length)),
#   topicBy = rep(july$topicBy, sapply(july_sentences, length)),
#   sentence = unlist(july_sentences)
# )

再過濾一次tab與換行符號，並將sentence轉成文字format。

july_sentences <- july_sentences %>% 
  filter(!str_detect(sentence, regex("^(\t|\n| )*$")))
july_sentences$sentence <- as.character(july_sentences$sentence)

Loading dictionary

載入第一類關鍵字，候選人提及他們自己的名字。

their_names <- scan(file = "dict/their_names.txt", 
                    what = character(),
                    sep = "\n",
                    encoding = "utf-8",
                    fileEncoding = "utf-8")
head(their_names)

## [1] "我"   "我們" "自己" "小弟" "小妹" "偶"

主要政治人物正面、中性、負面關鍵字(含綽號)

此關鍵字字典已先用python處理完，所有各候選人的綽號作爲keyword，共323個字。

candidate_name <- scan(file = "dict/candidate_name_new.txt",
                       what = character(),
                       sep = "\n",
                       encoding = "utf-8",
                       fileEncoding = "utf-8")
head(candidate_name)

## [1] "丁爺爺" "丁能兒" "叮手中" "輸不起" "subuki" "老狐狸"

整理過後的字典

新增思博新政

customized_dict <- scan(file = "dict/customized.txt",
                       what = character(),
                       sep = "\n",
                       encoding = "utf-8",
                       fileEncoding = "utf-8")

Tokenization

初步先用jieba進行斷詞，並新增上述字典。

jieba_tokenizer <- worker()
new_user_word(jieba_tokenizer, c(customized_dict))

## [1] TRUE

中文斷詞函式

chi_tokenizer <- function(t) {
  lapply(t, function(x) {
    #if(nchar(x) > 1){
      tokens <- segment(x, jieba_tokenizer)
     # tokens <- tokens[nchar(tokens) > 1]
      return(tokens)
    #}
  })
}

將sentence進行斷詞，斷詞結果再過濾英文與數字，同時以各文章計算詞頻。

tokens <- july_sentences %>% 
  unnest_tokens(word, sentence, token = chi_tokenizer) %>% 
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
  count(url, word) %>%
  rename(count = n)

將載入的字典轉換成dataframe，並合併成一個字典。

their_names_df <- data.frame(word=their_names)
candidate_name_df <- data.frame(word=candidate_name)
all_keys_df <- rbind(their_names_df, candidate_name_df)

過濾我們研究需要的關鍵字，並將value轉換成column

dcast_tokens <- tokens %>% 
  inner_join(all_keys_df) %>%
  dcast(url ~ word, value.var = "count", fill = 0)

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Aggregation function missing: defaulting to length

non_cols <- all_keys_df %>% 
  anti_join(tokens) %>% 
  dcast(. ~ word) %>% 
  select(-1)

## Joining, by = "word"

## Warning: Column `word` joining factor and character vector, coercing into
## character vector

## Using word as value column: use value.var to override.
## Aggregation function missing: defaulting to length

# 所有設為0
non_cols[1:nrow(non_cols),1:ncol(non_cols)] <- 0

# 合併回原tokens
dcast_tokens <- cbind(dcast_tokens, non_cols)

output_count <- july %>% 
  left_join(dcast_tokens, by=c("url"))
output_count[is.na(output_count)] <- 0

有75篇文章關鍵字都沒有出現過，將NA補0。

關鍵字有無出現

output_exist <- tokens %>% 
  mutate(exists = ifelse(word %in% all_keys_df$word, 1, 0)) %>% 
  filter(exists == 1) %>% 
  dcast(url ~ word, value.var = "exists", fill = 0)

temp <- july %>% 
  left_join(output_exist) %>% 
  replace(is.na(.), 0)

## Joining, by = "url"

總共有74篇都沒有這些關鍵字，所以left join為na，將這些na填補0。

輸出結果為CSV

# write.csv(output_count, file="output/output_count_07.csv", fileEncoding = "UTF-8")

write.csv(output_exist, file="output/output_exist_07.csv", fileEncoding = "UTF-8")

計算全部詞頻

url_person <- july %>% select(url, topicBy)
tokens <- tokens %>% left_join(url_person) %>% filter(!is.na(word))

## Joining, by = "url"

tokens_count <- tokens %>% 
#  filter(nchar(.$word) > 1) %>%
  group_by(word, topicBy) %>% 
  summarise(sum = n()) %>% 
  filter(sum > 10) %>% 
  arrange(topicBy, desc(sum))

「我」關鍵字各候選人比較

tokens_count %>% 
  filter(word=="我") %>% 
  ggplot(aes(word, sum, fill=topicBy)) +
  geom_col() +
  facet_grid(topicBy~.) +
  xlab(NULL) +
  coord_flip()

「我們」關鍵字各候選人比較

tokens_count %>% 
  filter(word=="我們") %>% 
  ggplot(aes(word, sum, fill=topicBy)) +
  geom_col() +
  facet_grid(topicBy~.) +
  xlab(NULL) +
  coord_flip()

綽號整理為候選人名字

import pandas as pd
import numpy as np
keywords=pd.read_excel("/Users/yuzhe/Downloads/資工背景工讀生_關鍵字整理列表(初步建構中)__20191111.xlsx",
                      sheet_name="1-02、1-04、1-06 主要政治人物")
keywords=keywords[['子類目', '關鍵字', '(此部分為綽號須注意)          負面關鍵字','(此部分為綽號須注意)             中性 or 正面關鍵字']]
keywords.columns=['name', 'first_name', 'negative_nick', 'positive_nick']
for i in range(len(keywords)):
    keywords['keyword'].iloc[i,]=str(keywords['first_name'].iloc[i,])+' or '+str(keywords['negative_nick'].iloc[i,])+ ' or '+str(keywords['positive_nick'].iloc[i,])
  
temp = keywords["keyword"].str.split(" or ")
kname=keywords['name']
name_nick.replace('nan', np.NaN, inplace=True)
name_nick.dropna(how='any',inplace=True)
name_nick.to_csv('cadidate_nickname.csv', index=False)

載入上述python處理完對應之候選人暱稱

candidate_nickname <- read_csv("dict/cadidate_nickname.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   keyword = col_character()
## )

colnames(candidate_nickname) <- c("name", "word")

各候選人替換暱稱後

true_name <- tokens %>% 
  inner_join(candidate_nickname) %>% 
  mutate(word = name, name=NULL) %>% 
  group_by(url, word) %>% 
  summarise(count=sum(count)) %>% 
  dcast(url ~ word, value.var = "count", fill = 0)

## Joining, by = "word"

# temp <- tokens %>% 
#   mutate(word = ifelse(word %in% c("守中", "丁爺爺", "丁能兒", "叮手中", "輸不起", "subuki"), "丁守中", word)) %>% 
#   mutate(word = ifelse(word %in% c("通靈王", "國民阿舅", "上海女婿", "五億探長", "地獄刑事", "文大包租公"), "侯友宜", word)) 
#

Sentiment Analysis

請依據本研究提供之情緒詞關鍵字字典，將各則貼文中出現的「正負情緒詞」進行權重計算，統計各則貼文的情緒值。
✓ 需要能去除停用詞及分詞
✓ 需要能套入指定字典之詞彙
✓ 需要能針對字典中的詞彙進行正負評價加權

載入情緒字典、stopwords

因為tmcn套件裡的stopwords許多為簡體字，故stopwords採用 tomlinNTUB所提供繁中-停用詞表，並移除一些標點符，只留中文

# 台大情緒字典
data("NTUSD")
# 停用字
stopwords <- scan(file = "dict/stopwords_tc.txt",
                       what = character(),
                       sep = "\n",
                       encoding = "utf-8",
                       fileEncoding = "utf-8")
stopwords_df <- data.frame(word=stopwords)
head(stopwords_df, 10)

##    word
## 1    啊
## 2    阿
## 3    哎
## 4  哎呀
## 5  哎喲
## 6    唉
## 7    我
## 8  我們
## 9    按
## 10 按照

整理台大情緒字典格式

positive <- data.frame(word = NTUSD$positive_cht, sentiment="positive") 
negative <- data.frame(word = NTUSD$negative_cht, sentiment="negative")
NTUSD_ch <- rbind(positive, negative)
head(NTUSD_ch)

##         word sentiment
## 1   一帆風順  positive
## 2 一帆風順的  positive
## 3       一流  positive
## 4       一致  positive
## 5     一致的  positive
## 6     了不起  positive

LIWC字典，還在尋找更多LIWC字典，這是課堂助教提供的字典，字數不多。
原本想做比較，但台大字典有1w多個字，這只有1k字左右，暫不比較。

p <- read_file("dict/liwc/positive.txt")
n <- read_file("dict/liwc/negative.txt")
positive <- strsplit(p, ",")[[1]]
negative <- strsplit(n, ",")[[1]]
positive <- data.frame(word = positive, sentiments = "positive", stringsAsFactors = FALSE) 
negative <- data.frame(word = negative, sentiemtns = "negative", stringsAsFactors = FALSE)
colnames(negative) = c("word","sentiment")
colnames(positive) = c("word","sentiment")
LIWC_ch <- rbind(positive, negative)

整體7月常出現用詞與文字雲

word_count <- tokens %>%
  # 去除stopwords
  anti_join(stopwords_df) %>% 
  select(word, count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count))  %>%
  filter(count>3)

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

# 玩玩看文字雲
wordcloud2(word_count)

正負面情緒常出現用詞

word_count %>% 
  inner_join(NTUSD_ch) %>%
  group_by(sentiment) %>%
  top_n(15, wt = count) %>%
  ungroup() %>% 
  mutate(word = reorder(word, count)) %>%
  ggplot(aes(word, count, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

計算各貼文情緒用詞

將台大情緒字典的正面詞給予權重=1，負面詞給予權重=-1，計算每篇總和。

sentiment_weight <- tokens %>% 
  inner_join(NTUSD_ch) %>% 
  mutate(sentiment=ifelse(sentiment %in% "positive", 1, -1)) %>% 
  group_by(url) %>% 
  summarise(swt = sum(sentiment))

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

head(sentiment_weight)

## # A tibble: 6 x 2
##   url                                                              swt
##   <fct>                                                          <dbl>
## 1 http://www.facebook.com/109391162488374/posts/1778415255585948     3
## 2 http://www.facebook.com/109391162488374/posts/1779467668814040    -1
## 3 http://www.facebook.com/109391162488374/posts/1781234038637403     3
## 4 http://www.facebook.com/109391162488374/posts/1781941188566688     5
## 5 http://www.facebook.com/109391162488374/posts/1785964631497677    -3
## 6 http://www.facebook.com/109391162488374/posts/1787733981320742     3

依照各候選人統計每篇文章平均情緒分數

avg_candidate_sentiment <- july %>% 
  left_join(sentiment_weight) %>% 
  group_by(topicBy) %>% 
  summarise(mean_wt = mean(swt, na.rm = T))

## Joining, by = "url"

head(avg_candidate_sentiment)

## # A tibble: 6 x 2
##   topicBy       mean_wt
##   <fct>           <dbl>
## 1 丁守中粉絲團    -1.02
## 2 侯友宜           3.46
## 3 姚文智翻台北     1.48
## 4 林佳龍           5.34
## 5 林義豐MarkLin    1.58
## 6 柯文哲           3.48

視覺化

avg_candidate_sentiment %>% 
  arrange(desc(mean_wt)) %>% 
  ggplot(aes(x=topicBy, y=mean_wt, fill=topicBy)) + 
  geom_col(show.legend = F) +
  labs(x=NULL) +
  theme(axis.text.x = element_text(size=10, angle=30))

隨時間貼文的情緒變化

senti_changes_date <- july_data %>% 
  select(url, date, topicBy) %>%
  mutate(date = format(date, "%Y-%m-%d")) %>% 
  inner_join(tokens) %>% 
  inner_join(NTUSD_ch) %>% 
  group_by(topicBy, date, sentiment) %>% 
  summarise(count=sum(count))

## Joining, by = c("url", "topicBy")Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

all_dates <- expand.grid(seq(as.Date(min(july_data$date)), as.Date(max(july_data$date)), by="day"), c("positive", "negative"))
colnames(all_dates) <- c("date", "sentiment")

senti_changes_date <- all_dates %>% 
  merge(senti_changes_date, by.x=c('date', "sentiment"), by.y=c('date', "sentiment"),
        all.x=T, all.y=T) %>% 
  mutate(count = replace_na(count, 0))

senti_changes_date %>%
  ggplot()+
  geom_line(aes(x=date, y=count, colour=topicBy)) +
  facet_wrap(~ sentiment, nrow=2, scales="free_y") +
  scale_x_date(labels = date_format("%m/%d"))

## 5-1議題用字

issue_keyword <- read.xlsx("dict/keyword_lists_20191210.xlsx", sheetIndex = 7, encoding = "UTF-8")
issue_keyword <- issue_keyword %>% 
  select(子類目, 關鍵字) %>% 
  rename(category=子類目, word=關鍵字)

n_distinct(issue_keyword$category)

## [1] 40

共40個類別

cate_col_names <- unique(issue_keyword$category) %>% as.character()

5-01議題

各議題關鍵字數量

issue_keyword %>% 
  group_by(category) %>% 
  count() %>% 
  arrange(desc(n))

## # A tibble: 40 x 2
## # Groups:   category [40]
##    category           n
##    <fct>          <int>
##  1 環境污染與保護    81
##  2 體育              75
##  3 財政稅金制度      67
##  4 科技發展          59
##  5 交通建設          51
##  6 兩岸關係          50
##  7 經濟發展          50
##  8 交通安全          40
##  9 原住民議題        40
## 10 國防安全          40
## # … with 30 more rows

七月各候選人提到議題數量

tokens %>% 
  left_join(issue_keyword, by="word") %>% 
  group_by(topicBy, category) %>% 
  filter(!is.na(category)) %>% 
  count()

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 229 x 3
## # Groups:   topicBy, category [229]
##    topicBy      category                           n
##    <fct>        <fct>                          <int>
##  1 丁守中粉絲團 交通安全                           2
##  2 丁守中粉絲團 交通建設                           3
##  3 丁守中粉絲團 健康醫療                           4
##  4 丁守中粉絲團 公共建設                           2
##  5 丁守中粉絲團 公共行政                           1
##  6 丁守中粉絲團 國家認同－東京奧運正名「台灣」     4
##  7 丁守中粉絲團 國際外交                           6
##  8 丁守中粉絲團 土地正義                           3
##  9 丁守中粉絲團 年金改革                           1
## 10 丁守中粉絲團 所得分配                           1
## # … with 219 more rows

七月各候選人提到議題視覺化

tokens %>% 
  left_join(issue_keyword, by="word") %>% 
  group_by(topicBy, category) %>% 
  filter(!is.na(category)) %>% 
  count() %>% 
  ggplot(aes(x=category, y=n)) +
  geom_bar(stat = "identity") +
  xlab(NULL) +
  theme(axis.text.x = element_text(size=6, angle=30))

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`

g <- tokens %>% 
  left_join(issue_keyword, by="word") %>% 
  group_by(topicBy, category) %>% 
  filter(!is.na(category)) %>% 
  count()

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`

ggplot(g, aes(x=topicBy, y=n)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ category)+
  xlab(NULL) +
  theme(axis.text.x = element_text(size=6, angle=30))

將各議題轉成COLUMN表示

tokens %>% 
  left_join(issue_keyword, by="word") %>% 
  group_by(topicBy, category) %>% 
  filter(!is.na(category)) %>% 
  count() %>% 
  dcast(topicBy ~ category, value.var = "n", fill = 0) %>% 
  head()

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`

##         topicBy 交通安全 交通建設 健康醫療 兩岸關係 公共建設 公共行政
## 1  丁守中粉絲團        2        3        4        0        2        1
## 2        侯友宜        0        3        3        0        1        0
## 3  姚文智翻台北        0        0        1        2        2        0
## 4        林佳龍        3       11        4        0       10        0
## 5 林義豐MarkLin        2        4        1        0        0        0
## 6        柯文哲        5        1        6        1        0        0
##   勞工問題 原住民議題 同婚 國家認同－東京奧運正名「台灣」 國防安全
## 1        0          0    0                              4        0
## 2        0          0    0                              0        1
## 3        0          0    0                              0        0
## 4        0          1    0                              0        0
## 5        0          0    0                              0        0
## 6        0          0    0                              0        0
##   國際外交 土地正義 城鄉發展差距 年金改革 所得分配 政治貪腐 教育政策
## 1        6        3            0        1        1        2        6
## 2        0        0            0        0        0        0        5
## 3        3        0            0        1        0        0        2
## 4        9        0            0        0        2        0        9
## 5       17        0            0        0        0        0        0
## 6        1        2            0        0        1        0        6
##   文化藝術與文創 族群議題 死刑存廢 災情與災難防治 環境污染與保護 社會治安
## 1              4        2        2              2              1        7
## 2              1        0        0              3              1        2
## 3              2        0        0              2              1        0
## 4              6        1        0             13              8        0
## 5              0        0        0              0              0        0
## 6             12        2        0             11              1        0
##   社會福利 科技發展 經濟發展 能源議題 觀光發展 財政稅金制度 轉型正義
## 1        1        2        6        1        4            0        0
## 2        3        2        0        0       10            0        0
## 3        0        0        0        0        0            0        2
## 4        0       12        1        8        5            1        0
## 5        0        8        0        0        2            0        0
## 6        0        5        2        0        4            0        0
##   農漁業問題 食品安全 體育 高齡社會
## 1          5        0    9        1
## 2          3        0    6        0
## 3          2        1   10        0
## 4          0        3   65        1
## 5          0        0    5        0
## 6          0        0    9        5

將各議題輸出成EXCEL

output_issue_count_07 <- tokens %>% 
  left_join(issue_keyword, by="word") %>% 
  group_by(url, category) %>% 
  filter(!is.na(category)) %>% 
  count() %>% 
  ungroup() %>% 
  dcast(url ~ category, value.var = "n", fill = 0) %>% 
  right_join(july, by="url")

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`

output_issue_count_07[is.na(output_issue_count_07)] <- 0

write.xlsx(output_issue_count_07, "output/output_issue_count_07.xlsx")
rm(output_issue_count_07)

4 人格

Topic Model

DFM: Data Frequency Matrix DTM: Document-term Matrix

july_dtm <- tokens %>% 
  cast_dtm(url, word, count)
july_dtm

## <<DocumentTermMatrix (documents: 1085, terms: 13509)>>
## Non-/sparse entries: 53765/14603500
## Sparsity           : 100%
## Maximal term length: 9
## Weighting          : term frequency (tf)

fit modeling

ERROR solution: Each row of the input matrix needs to contain at least one non-zero entry

rowTotals <- apply(july_dtm , 1, sum) #Find the sum of words in each Document
july_dtm <- july_dtm[rowTotals > 0, ] #remove all docs without words
july_lda <- LDA(july_dtm, k = 6, control = list(seed = 1234))

Word-topic probabilities

ap_topics <- tidy(july_lda, matrix = "beta")
ap_topics

## # A tibble: 81,054 x 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     1 上    0.00161 
##  2     2 上    0.000307
##  3     3 上    0.00147 
##  4     4 上    0.00146 
##  5     5 上    0.00104 
##  6     6 上    0.000633
##  7     1 並    0.000380
##  8     2 並    0.00105 
##  9     3 並    0.000473
## 10     4 並    0.00102 
## # … with 81,044 more rows

tidy(july_lda, matrix = "beta") %>%
  filter(!term %in% stopwords) %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  theme(text = element_text(family = "黑體-繁 中黑"))

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread

## # A tibble: 279 x 8
##    term     topic1   topic2   topic3   topic4   topic5   topic6 log_ratio
##    <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
##  1 一     1.53e- 3 3.99e- 4 2.48e- 4 1.71e- 3 0.00145  1.15e- 3    -1.94 
##  2 一起   2.72e- 3 6.31e- 4 4.80e- 3 2.00e- 3 0.00308  1.05e- 3    -2.11 
##  3 丁守中 4.16e- 3 6.76e-37 1.07e- 3 8.55e- 4 0.000996 1.00e-12  -112.   
##  4 上     1.61e- 3 3.07e- 4 1.47e- 3 1.46e- 3 0.00104  6.33e- 4    -2.39 
##  5 下午   1.01e- 3 9.01e-19 1.22e-20 7.62e- 4 0.000420 4.38e- 4   -50.0  
##  6 不     3.86e- 3 2.35e- 3 5.91e- 3 3.12e- 3 0.00326  1.15e- 8    -0.716
##  7 不是   1.63e- 3 1.47e-10 9.02e- 4 3.20e- 4 0.000368 1.58e-14   -23.4  
##  8 並     3.80e- 4 1.05e- 3 4.73e- 4 1.02e- 3 0.000768 2.69e- 3     1.46 
##  9 中     7.47e- 3 2.45e- 3 2.65e- 3 3.43e- 3 0.00123  7.83e- 4    -1.61 
## 10 中壢   5.02e-52 2.72e- 3 3.64e- 4 3.91e-33 0.00137  1.04e- 7   162.   
## # … with 269 more rows

beta_spread %>% 
  top_n(10, log_ratio) %>% 
  rbind(beta_spread %>% top_n(-10, log_ratio)) %>%
  arrange(-log_ratio) %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(term, log_ratio)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  theme(text = element_text(family = "黑體-繁 中黑"))

predicting topic of posts in july

july_topics <- tidy(july_lda, matrix = "gamma") %>% 
                group_by(document) %>%
                top_n(1, wt = gamma)
head(july_topics)

## # A tibble: 6 x 3
## # Groups:   document [6]
##   document                                                      topic gamma
##   <chr>                                                         <int> <dbl>
## 1 http://www.facebook.com/109391162488374/posts/17784152555859…     1 0.998
## 2 http://www.facebook.com/109391162488374/posts/17794676688140…     1 0.996
## 3 http://www.facebook.com/109391162488374/posts/17812340386374…     1 0.996
## 4 http://www.facebook.com/109391162488374/posts/17819411885666…     1 0.999
## 5 http://www.facebook.com/109391162488374/posts/17859646314976…     1 0.998
## 6 http://www.facebook.com/109391162488374/posts/17877339813207…     1 0.525

TF-IDF

tokens_tf_idf <- tokens %>% 
  bind_tf_idf(word, topicBy, count) %>% 
  arrange(desc(tf_idf))

## Warning in bind_tf_idf.data.frame(., word, topicBy, count): A value for tf_idf is negative:
## Input should have exactly one row per document-term combination.

tokens_tf_idf %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(topicBy) %>% 
  top_n(15) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = topicBy)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~topicBy, ncol = 6, scales = "free") +
  coord_flip()

## Selecting by tf_idf

Sentiment weight with tf-idf

sentiment_tf_idf <- tokens_tf_idf %>% 
  inner_join(NTUSD_ch) %>% 
  mutate(sentiment=ifelse(sentiment %in% "positive", 1, -1),
         weighted_sentiment=tf_idf*sentiment)

## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

sentiment_tf_idf %>% 
  group_by(url) %>% 
  summarise(swt = sum(weighted_sentiment)) %>% 
  left_join(url_person) %>% 
  ungroup() %>% 
  group_by(topicBy) %>% 
  summarise(swt=sum(swt)) %>% 
  ggplot(aes(x=topicBy, y=swt))+
  geom_col() +
  theme(axis.text.x=element_text(angle=30))

## Joining, by = "url"

用TF-IDF計算情緒權重全部人都變成負的

各貼文用TF-IDF計算情緒權重

sentiment_tf_idf %>% 
  group_by(url, topicBy) %>% 
  summarise(ws=sum(weighted_sentiment)) %>% 
  arrange(desc(ws)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   url [10]
##    url                                                  topicBy          ws
##    <fct>                                                <fct>         <dbl>
##  1 http://www.facebook.com/1863023523934803/posts/2207… 韓國瑜      0.00682
##  2 http://www.facebook.com/261813197541354/posts/80316… 侯友宜      0.00450
##  3 http://www.facebook.com/261813197541354/posts/79210… 侯友宜      0.00328
##  4 http://www.facebook.com/261813197541354/posts/80780… 侯友宜      0.00314
##  5 http://www.facebook.com/1863023523934803/posts/2233… 韓國瑜      0.00310
##  6 http://www.facebook.com/339483189800311/posts/52640… 高思博 A Po 0.00287
##  7 http://www.facebook.com/1912587195640543/posts/2172… 林義豐MarkLin… 0.00258
##  8 http://www.facebook.com/1863023523934803/posts/2228… 韓國瑜      0.00257
##  9 http://www.facebook.com/1380211668909443/posts/2044… 姚文智翻台北… 0.00234
## 10 http://www.facebook.com/136845026417486/posts/13844… 柯文哲      0.00223

Feature Selection

july_dtm_tf_idf <- tokens %>% 
  cast_dtm(url, word, count, weighting = tm::weightTfIdf)
july_dtm_tf_idf

## <<DocumentTermMatrix (documents: 1085, terms: 13509)>>
## Non-/sparse entries: 53765/14603500
## Sparsity           : 100%
## Maximal term length: 9
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

july_dtm_tf_idf <- removeSparseTerms(july_dtm_tf_idf, sparse=.99)

text2vec

library(text2vec)

## 
## Attaching package: 'text2vec'

## The following object is masked from 'package:topicmodels':
## 
##     perplexity

article_words <- lapply(as.character(july_sentences$sentence), 
  function(x) {
      tokens <- segment(x, jieba_tokenizer)
      tokens <- tokens[!str_detect(tokens, regex("[0-9a-zA-Z]"))]
      tokens <- tokens[!(tokens %in% stopwords)]
      return(tokens)
  })
# unique word vector
tokens_vec <- itoken(article_words)
# word frequency document
vocab_vec <- create_vocabulary(tokens_vec, ngram=c(1,1))
tail(vocab_vec, 20)

## Number of docs: 655 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##         term term_count doc_count
##  1:     城市        150        96
##  2:     發展        156        97
##  3:     高雄        161        53
##  4:     市民        162       110
##  5:       月        163        91
##  6:     今天        169       141
##  7:     未來        172       127
##  8:       年        175       102
##  9:     台中        178        90
## 10:   陳其邁        179        46
## 11:     一起        189       136
## 12:     文化        189        87
## 13:       新        189        95
## 14:       篇        207        10
## 15:       中        224       146
## 16:     台灣        226       123
## 17:       話        237        13
## 18: 思博新政        248        10
## 19:     市長        338       198
## 20:     桃園        402       125

# it <- itoken(tokens, ids = movie_review$id[1:4000], progressbar = FALSE)
it <- itoken(article_words)
v <- create_vocabulary(it)
v <- prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.2)
  
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it, vectorizer, type = "dgTMatrix")

lda_model <- LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(x = dtm, n_iter = 1000, 
                          convergence_tol = 0.001, n_check_convergence = 25, 
                          progressbar = FALSE)

## INFO [2020-01-20 09:18:51] iter 25 loglikelihood = -165219.461
## INFO [2020-01-20 09:18:51] iter 50 loglikelihood = -161820.825
## INFO [2020-01-20 09:18:51] iter 75 loglikelihood = -160333.801
## INFO [2020-01-20 09:18:51] iter 100 loglikelihood = -159143.744
## INFO [2020-01-20 09:18:51] iter 125 loglikelihood = -158659.131
## INFO [2020-01-20 09:18:51] iter 150 loglikelihood = -158358.275
## INFO [2020-01-20 09:18:51] iter 175 loglikelihood = -158234.676
## INFO [2020-01-20 09:18:51] early stopping at 175 iteration

LDA Topic proportion

barplot(doc_topic_distr[1, ], xlab = "topic", 
        ylab = "proportion", ylim = c(0, 1), 
        names.arg = 1:ncol(doc_topic_distr))

lambda為1各主題前10個字

lda_model$get_top_words(n = 10, topic_number = c(1L, 5L, 10L), lambda = 1)

##       [,1]     [,2]     [,3]    
##  [1,] "陳其邁" "服務"   "台北"  
##  [2,] "高雄"   "空間"   "丁守中"
##  [3,] "台南"   "捷運"   "柯文哲"
##  [4,] "議員"   "更新"   "經濟"  
##  [5,] "高雄市" "提供"   "新"    
##  [6,] "參選人" "蘇貞昌" "支持"  
##  [7,] "感謝"   "團隊"   "未來"  
##  [8,] "加入"   "推動"   "拚"    
##  [9,] "陳"     "市府"   "計畫"  
## [10,] "黃"     "現在"   "政策"

lda_model$get_top_words(n = 10, topic_number = c(1L, 5L, 10L), lambda = 0.2)

##       [,1]     [,2]       [,3]    
##  [1,] "陳其邁" "蘇貞昌"   "丁守中"
##  [2,] "高雄"   "里長"     "柯文哲"
##  [3,] "議員"   "更新"     "拚"    
##  [4,] "高雄市" "興建"     "台北"  
##  [5,] "黃"     "金山"     "住宅"  
##  [6,] "立即"   "機捷"     "驕傲"  
##  [7,] "陳"     "衝衝衝"   "產值"  
##  [8,] "表示"   "花博"     "心"    
##  [9,] "參選人" "營運"     "高中"  
## [10,] "立委"   "活動中心" "建立"

Dynamic Visualization

lda_model$plot()

## Loading required namespace: servr

Convert to GloVe

vectorizer_vec <- vocab_vectorizer(vocab_vec)
# tcm = term co-occurence matrix
tcm_vec <- create_tcm(tokens_vec, vectorizer_vec)

計算 tcm tem co-occurrence matrix

glove_model <- GlobalVectors$new(word_vectors_size = 50, vocabulary = create_vocabulary(tokens_vec), x_max = 10)
glove <- glove_model$fit_transform(tcm_vec, n_iter = 20)

## INFO [2020-01-20 09:18:53] 2020-01-20 09:18:53 - epoch 1, expected cost 0.0652
## INFO [2020-01-20 09:18:54] 2020-01-20 09:18:54 - epoch 2, expected cost 0.0379
## INFO [2020-01-20 09:18:54] 2020-01-20 09:18:54 - epoch 3, expected cost 0.0294
## INFO [2020-01-20 09:18:55] 2020-01-20 09:18:55 - epoch 4, expected cost 0.0243
## INFO [2020-01-20 09:18:55] 2020-01-20 09:18:55 - epoch 5, expected cost 0.0208
## INFO [2020-01-20 09:18:55] 2020-01-20 09:18:55 - epoch 6, expected cost 0.0181
## INFO [2020-01-20 09:18:56] 2020-01-20 09:18:56 - epoch 7, expected cost 0.0160
## INFO [2020-01-20 09:18:56] 2020-01-20 09:18:56 - epoch 8, expected cost 0.0144
## INFO [2020-01-20 09:18:57] 2020-01-20 09:18:57 - epoch 9, expected cost 0.0130
## INFO [2020-01-20 09:18:57] 2020-01-20 09:18:57 - epoch 10, expected cost 0.0119
## INFO [2020-01-20 09:18:57] 2020-01-20 09:18:57 - epoch 11, expected cost 0.0109
## INFO [2020-01-20 09:18:58] 2020-01-20 09:18:58 - epoch 12, expected cost 0.0101
## INFO [2020-01-20 09:18:58] 2020-01-20 09:18:58 - epoch 13, expected cost 0.0094
## INFO [2020-01-20 09:18:59] 2020-01-20 09:18:59 - epoch 14, expected cost 0.0087
## INFO [2020-01-20 09:18:59] 2020-01-20 09:18:59 - epoch 15, expected cost 0.0082
## INFO [2020-01-20 09:19:00] 2020-01-20 09:18:59 - epoch 16, expected cost 0.0077
## INFO [2020-01-20 09:19:00] 2020-01-20 09:19:00 - epoch 17, expected cost 0.0072
## INFO [2020-01-20 09:19:00] 2020-01-20 09:19:00 - epoch 18, expected cost 0.0069
## INFO [2020-01-20 09:19:01] 2020-01-20 09:19:01 - epoch 19, expected cost 0.0065
## INFO [2020-01-20 09:19:01] 2020-01-20 09:19:01 - epoch 20, expected cost 0.0062

## pca to see words location 
pca = prcomp(glove, scale. = T, center = T)
# rownames(pca$x) = rownames(word.vec)
i = grep('市長|', rownames(pca$x))
j = grep('台北|架子|勤奮',rownames(pca$x)[i])
plot(pca$x[i[-j] ,2], pca$x[i[-j] ,3], type = 'b', lty = 2, pch = 'x', col = 'red', xlim = c(0.8,-2), ylim = c(4,-1))
text(pca$x[i[-j] ,2], pca$x[i[-j] ,3], labels = rownames(pca$x)[i[-j]], pos = 4)

# i = grep('姚文智|丁能兒', rownames(pca$x))
# j = grep('市長|柯文哲|丁守中|姚文智',rownames(pca$x)[i])
# points(pca$x[i[-j] ,2], pca$x[i[-j] ,3], type = 'b', lty = 2, pch = 'x', col = 'red')
# text(pca$x[i[-j] ,2], pca$x[i[-j] ,3], labels = rownames(pca$x)[i[-j]], pos = 4)

k <- kmeans(glove, 300, iter.max = 100)
unique(k$cluster)

##   [1] 242 178 163 224  57 107  56 273   9 284   1 283  20 276  52 205 112
##  [18]  48 233 124 279  30  12  32 169  28  75 139 260 170 127 294 116  16
##  [35] 209 159 286 146  73 221 143  34 114 263  19  42  14  10  97 131 231
##  [52] 189  21  85 281  44 206 121  63  54 200  37  89  74 228 156 123 289
##  [69] 187 270 196 148   5  25  59 236 295  49 157  96 202  18 293  93 110
##  [86]   4 190 181  82 174 100 158  70 203  39  91 299  80 185 204 193 128
## [103] 162  43 234 198  66 101 117 183 250 151 251 144   8 232   6 291  72
## [120] 226  77 225 256 300 130  90 168 223  13 239 133 227 113 220 199 212
## [137] 188  45 230 208 177 173  87 255 207  50 109 160  84 216  23 103 182
## [154] 118 105 180  24  94 115 248 152 218 240  38 217  92 269  53 186 102
## [171] 257  88 136 237 134 252 147 265  81 126  55 261 229 172 132 104  60
## [188] 122 266 215 141  62 176  15 241  76 282  33  35 288 129 175  71 238
## [205]  64 272  27   7 297 296 155  69 264 287  79 262 137 267  86 145 197
## [222] 271 140  65  61 119 153 138 278  26 235 211 161  31  51 275   2 222
## [239] 149 166 201 245  58  95 285  36 135 244  67  40 171  17 194  83 195
## [256] 165 111 254 213 142 120 179  98 280 247  22  29 210 290 191 274 249
## [273]  47 154 268 219 243 259  68 150 106 167  46 125 108  99 192 214 184
## [290] 298 292  78 164  41 258 277  11 253   3 246

用詞相似度計算

Structural topic models (STM) 暫略

https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/

library(stm)

## stm v1.3.5 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(stminsights)
library(quanteda)

## Package version: 1.5.2

## Parallel computing: 2 of 24 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following object is masked from 'package:utils':
## 
##     View

# dfm2stm <- convert(july_dtm_tf_idf, to = "stm")
# model.stm <- stm(
#   dfm2stm$documents,
#   dfm2stm$vocab,
#   K = topic.count,
#   data = dfm2stm$meta,
#   init.type = "Spectral"
# )

記錄區

https://www.facebook.com/339483189800311/posts/532367277178567 高思博這篇名字計算為37次，高思博常在貼文後加很多hashtag，或整理文章，像思博新政

【思博新政-第一話】治安篇 https://tinyurl.com/apokao1   
【思博新政-第二話】水資源篇 https://tinyurl.com/apokaowater

這種好幾話，導致思博、新政出現頻率很高。 > 已將「思博新政」作為詞集

7月～11月各候選人議題變化

根據資管碩士論文：《應用情感分析於輿情之研究-以台灣2016總統選舉為例》(2014)發現，NTUSD字典中相當缺乏政治領域以及網友評論之字詞，例如：賣台、挺柱、下台等，需擴充種子磁集。
需要新增ChineseWordnet、SentiWordNet，情緒分數做正規化(Z-score) https://liferay.de.dariah.eu/tatom/feature_selection.html