一、工作事項與需要能力
1.斷詞分析(Segment analysis):
請將臉書各則貼文之「文字敘述」,切割為「詞」(最小有意義且可以自由使用的語言單位)。
1. 熟悉斷詞模型,包括(但不限於):最大概率法( MPSegment )、隱式馬爾科夫模型( HMMSegment )、混合模型( MixSegment )、索引模型( QuerySegment )
2. 熟悉文字探勘處理套件,包括(但不限於):tm、tidy、jiebaR
2.詞頻(關鍵字)統計: 請依據本研究提供之關鍵字列表,統計各個關鍵字在各則貼文中出現的「次數」。
→舉例:
本研究給定關鍵字「我們」;希望回傳檔案能在後方加入一欄位,統計出各則貼文所出現的關鍵字「我們」次數。
例如:第一、第二、第三則貼文文字敘述中的「我們」,各出現0、0、2次。
3.詞頻(關鍵字)是否出現:請依據本研究提供之關鍵字列表,標記各則貼文中有無出現各個關鍵字。 →舉例:本研究給定關鍵字「我們」;希望回傳檔案能在後方加入一欄位,標記各則貼文有無出現關鍵字「我們」。
✓ 例如:第一、第二則貼文文字敘述中,沒有出現「我們」關鍵字(標記為0);第三則貼文文字敘述中,有出現「我們」關鍵字(標記為1)。
4.情緒分析 (Sentiment analysis):請依據本研究提供之情緒詞關鍵字字典,將各則貼文中出現的「正負情緒詞」進行權重計算,統計各則貼文的情緒值。
✓ 需要能去除停用詞及分詞
✓ 需要能套入指定字典之詞彙
✓ 需要能針對字典中的詞彙進行正負評價加權
5.依候選人,將各類目關鍵字出現次數加總:請依據候選人,將類目所有出現的關鍵字次數進行加總。
→舉例:本研究給定類目「1-01提及自己名字」總共有「柯文哲」、「文哲」、「阿北」三個關鍵字;希望回傳檔案能在「柯文哲」、「姚文智」等14位候選人類目後方均加入一欄位,標記「各關鍵字出現次數總和」、「各關鍵字出現種類總和」。 例如:在所有柯文哲粉絲專頁貼文,出現0次「柯文哲」、4次「柯P」、1次「阿北」,則標註「各關鍵字出現次數總和」標記為 5 (= 0+4+1)、「各關鍵字出現種類總和」標記為 2 (共兩種關鍵字)。
library(xlsx)
library(readr)
library(dplyr)
library(jiebaR)
library(tidyr)
library(tidytext)
library(topicmodels)
library(stringr)
library(ggplot2)
library(wordcloud2)
library(data.table)
library(reshape2)
library(scales)
library(tm)
library(tmcn)
library(lubridate)
# 7月份資料
july_data <- read.xlsx("data/07.xlsx", sheetIndex = 1, encoding = "UTF-8")
# 8月份資料
# aug_data <- read.xlsx("data/08.xlsx", sheetIndex = 1, encoding = "UTF-8")
只篩選政治人物自己的貼文,並挑選欄位:
- url - content - topicBy
colnames(july_data)[1:5] <- c("note", "dpm", "day", "mid", "id")
july <- july_data[is.na(july_data$note),] %>%
select(url, content, topicBy)
碰到第一個問題:
表情符號會是亂碼,且不是分析標的,要把emoji表情符號過濾掉
參考資料: 所有表情符號的unicode
刪除表情符號後,還有一些內容是原始檔案就有的編碼錯誤, 如:
july_sentences <- july %>%
# remove emoji
mutate(sentence = gsub("(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])", "", content)) %>%
# remove non UTF-8
mutate(sentence = gsub("[^[:alnum:][:blank:]?&/\\-]", "", sentence))
# july_sentences <- strsplit(july_sentences$sentence, "[。!;?!?;]")
#
# july_sentences <- data.frame(
# url = rep(july$url, sapply(july_sentences, length)),
# topicBy = rep(july$topicBy, sapply(july_sentences, length)),
# sentence = unlist(july_sentences)
# )
再過濾一次tab與換行符號,並將sentence轉成文字format。
july_sentences <- july_sentences %>%
filter(!str_detect(sentence, regex("^(\t|\n| )*$")))
july_sentences$sentence <- as.character(july_sentences$sentence)
載入第一類關鍵字,候選人提及他們自己的名字。
their_names <- scan(file = "dict/their_names.txt",
what = character(),
sep = "\n",
encoding = "utf-8",
fileEncoding = "utf-8")
head(their_names)
## [1] "我" "我們" "自己" "小弟" "小妹" "偶"
此關鍵字字典已先用python處理完,所有各候選人的綽號作爲keyword,共323個字。
candidate_name <- scan(file = "dict/candidate_name_new.txt",
what = character(),
sep = "\n",
encoding = "utf-8",
fileEncoding = "utf-8")
head(candidate_name)
## [1] "丁爺爺" "丁能兒" "叮手中" "輸不起" "subuki" "老狐狸"
新增思博新政
customized_dict <- scan(file = "dict/customized.txt",
what = character(),
sep = "\n",
encoding = "utf-8",
fileEncoding = "utf-8")
初步先用jieba進行斷詞,並新增上述字典。
jieba_tokenizer <- worker()
new_user_word(jieba_tokenizer, c(customized_dict))
## [1] TRUE
中文斷詞函式
chi_tokenizer <- function(t) {
lapply(t, function(x) {
#if(nchar(x) > 1){
tokens <- segment(x, jieba_tokenizer)
# tokens <- tokens[nchar(tokens) > 1]
return(tokens)
#}
})
}
將sentence進行斷詞,斷詞結果再過濾英文與數字,同時以各文章計算詞頻。
tokens <- july_sentences %>%
unnest_tokens(word, sentence, token = chi_tokenizer) %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
count(url, word) %>%
rename(count = n)
將載入的字典轉換成dataframe,並合併成一個字典。
their_names_df <- data.frame(word=their_names)
candidate_name_df <- data.frame(word=candidate_name)
all_keys_df <- rbind(their_names_df, candidate_name_df)
過濾我們研究需要的關鍵字,並將value轉換成column
dcast_tokens <- tokens %>%
inner_join(all_keys_df) %>%
dcast(url ~ word, value.var = "count", fill = 0)
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Aggregation function missing: defaulting to length
non_cols <- all_keys_df %>%
anti_join(tokens) %>%
dcast(. ~ word) %>%
select(-1)
## Joining, by = "word"
## Warning: Column `word` joining factor and character vector, coercing into
## character vector
## Using word as value column: use value.var to override.
## Aggregation function missing: defaulting to length
# 所有設為0
non_cols[1:nrow(non_cols),1:ncol(non_cols)] <- 0
# 合併回原tokens
dcast_tokens <- cbind(dcast_tokens, non_cols)
output_count <- july %>%
left_join(dcast_tokens, by=c("url"))
output_count[is.na(output_count)] <- 0
有75篇文章關鍵字都沒有出現過,將NA補0。
output_exist <- tokens %>%
mutate(exists = ifelse(word %in% all_keys_df$word, 1, 0)) %>%
filter(exists == 1) %>%
dcast(url ~ word, value.var = "exists", fill = 0)
temp <- july %>%
left_join(output_exist) %>%
replace(is.na(.), 0)
## Joining, by = "url"
總共有74篇都沒有這些關鍵字,所以left join為na,將這些na填補0。
# write.csv(output_count, file="output/output_count_07.csv", fileEncoding = "UTF-8")
write.csv(output_exist, file="output/output_exist_07.csv", fileEncoding = "UTF-8")
url_person <- july %>% select(url, topicBy)
tokens <- tokens %>% left_join(url_person) %>% filter(!is.na(word))
## Joining, by = "url"
tokens_count <- tokens %>%
# filter(nchar(.$word) > 1) %>%
group_by(word, topicBy) %>%
summarise(sum = n()) %>%
filter(sum > 10) %>%
arrange(topicBy, desc(sum))
tokens_count %>%
filter(word=="我") %>%
ggplot(aes(word, sum, fill=topicBy)) +
geom_col() +
facet_grid(topicBy~.) +
xlab(NULL) +
coord_flip()
tokens_count %>%
filter(word=="我們") %>%
ggplot(aes(word, sum, fill=topicBy)) +
geom_col() +
facet_grid(topicBy~.) +
xlab(NULL) +
coord_flip()
import pandas as pd
import numpy as np
keywords=pd.read_excel("/Users/yuzhe/Downloads/資工背景工讀生_關鍵字整理列表(初步建構中)__20191111.xlsx",
sheet_name="1-02、1-04、1-06 主要政治人物")
keywords=keywords[['子類目', '關鍵字', '(此部分為綽號須注意) 負面關鍵字','(此部分為綽號須注意) 中性 or 正面關鍵字']]
keywords.columns=['name', 'first_name', 'negative_nick', 'positive_nick']
for i in range(len(keywords)):
keywords['keyword'].iloc[i,]=str(keywords['first_name'].iloc[i,])+' or '+str(keywords['negative_nick'].iloc[i,])+ ' or '+str(keywords['positive_nick'].iloc[i,])
temp = keywords["keyword"].str.split(" or ")
kname=keywords['name']
name_nick.replace('nan', np.NaN, inplace=True)
name_nick.dropna(how='any',inplace=True)
name_nick.to_csv('cadidate_nickname.csv', index=False)
載入上述python處理完對應之候選人暱稱
candidate_nickname <- read_csv("dict/cadidate_nickname.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## keyword = col_character()
## )
colnames(candidate_nickname) <- c("name", "word")
各候選人替換暱稱後
true_name <- tokens %>%
inner_join(candidate_nickname) %>%
mutate(word = name, name=NULL) %>%
group_by(url, word) %>%
summarise(count=sum(count)) %>%
dcast(url ~ word, value.var = "count", fill = 0)
## Joining, by = "word"
# temp <- tokens %>%
# mutate(word = ifelse(word %in% c("守中", "丁爺爺", "丁能兒", "叮手中", "輸不起", "subuki"), "丁守中", word)) %>%
# mutate(word = ifelse(word %in% c("通靈王", "國民阿舅", "上海女婿", "五億探長", "地獄刑事", "文大包租公"), "侯友宜", word))
#
請依據本研究提供之情緒詞關鍵字字典,將各則貼文中出現的「正負情緒詞」進行權重計算,統計各則貼文的情緒值。
✓ 需要能去除停用詞及分詞
✓ 需要能套入指定字典之詞彙
✓ 需要能針對字典中的詞彙進行正負評價加權
因為tmcn套件裡的stopwords許多為簡體字,故stopwords採用 tomlinNTUB所提供繁中-停用詞表,並移除一些標點符,只留中文
# 台大情緒字典
data("NTUSD")
# 停用字
stopwords <- scan(file = "dict/stopwords_tc.txt",
what = character(),
sep = "\n",
encoding = "utf-8",
fileEncoding = "utf-8")
stopwords_df <- data.frame(word=stopwords)
head(stopwords_df, 10)
## word
## 1 啊
## 2 阿
## 3 哎
## 4 哎呀
## 5 哎喲
## 6 唉
## 7 我
## 8 我們
## 9 按
## 10 按照
整理台大情緒字典格式
positive <- data.frame(word = NTUSD$positive_cht, sentiment="positive")
negative <- data.frame(word = NTUSD$negative_cht, sentiment="negative")
NTUSD_ch <- rbind(positive, negative)
head(NTUSD_ch)
## word sentiment
## 1 一帆風順 positive
## 2 一帆風順的 positive
## 3 一流 positive
## 4 一致 positive
## 5 一致的 positive
## 6 了不起 positive
LIWC字典,還在尋找更多LIWC字典,這是課堂助教提供的字典,字數不多。
原本想做比較,但台大字典有1w多個字,這只有1k字左右,暫不比較。
p <- read_file("dict/liwc/positive.txt")
n <- read_file("dict/liwc/negative.txt")
positive <- strsplit(p, ",")[[1]]
negative <- strsplit(n, ",")[[1]]
positive <- data.frame(word = positive, sentiments = "positive", stringsAsFactors = FALSE)
negative <- data.frame(word = negative, sentiemtns = "negative", stringsAsFactors = FALSE)
colnames(negative) = c("word","sentiment")
colnames(positive) = c("word","sentiment")
LIWC_ch <- rbind(positive, negative)
word_count <- tokens %>%
# 去除stopwords
anti_join(stopwords_df) %>%
select(word, count) %>%
group_by(word) %>%
summarise(count = sum(count)) %>%
filter(count>3)
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
# 玩玩看文字雲
wordcloud2(word_count)
word_count %>%
inner_join(NTUSD_ch) %>%
group_by(sentiment) %>%
top_n(15, wt = count) %>%
ungroup() %>%
mutate(word = reorder(word, count)) %>%
ggplot(aes(word, count, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
theme(text=element_text(size=14))+
coord_flip()
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
將台大情緒字典的正面詞給予權重=1,負面詞給予權重=-1,計算每篇總和。
sentiment_weight <- tokens %>%
inner_join(NTUSD_ch) %>%
mutate(sentiment=ifelse(sentiment %in% "positive", 1, -1)) %>%
group_by(url) %>%
summarise(swt = sum(sentiment))
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
head(sentiment_weight)
## # A tibble: 6 x 2
## url swt
## <fct> <dbl>
## 1 http://www.facebook.com/109391162488374/posts/1778415255585948 3
## 2 http://www.facebook.com/109391162488374/posts/1779467668814040 -1
## 3 http://www.facebook.com/109391162488374/posts/1781234038637403 3
## 4 http://www.facebook.com/109391162488374/posts/1781941188566688 5
## 5 http://www.facebook.com/109391162488374/posts/1785964631497677 -3
## 6 http://www.facebook.com/109391162488374/posts/1787733981320742 3
依照各候選人統計每篇文章平均情緒分數
avg_candidate_sentiment <- july %>%
left_join(sentiment_weight) %>%
group_by(topicBy) %>%
summarise(mean_wt = mean(swt, na.rm = T))
## Joining, by = "url"
head(avg_candidate_sentiment)
## # A tibble: 6 x 2
## topicBy mean_wt
## <fct> <dbl>
## 1 丁守中粉絲團 -1.02
## 2 侯友宜 3.46
## 3 姚文智翻台北 1.48
## 4 林佳龍 5.34
## 5 林義豐MarkLin 1.58
## 6 柯文哲 3.48
視覺化
avg_candidate_sentiment %>%
arrange(desc(mean_wt)) %>%
ggplot(aes(x=topicBy, y=mean_wt, fill=topicBy)) +
geom_col(show.legend = F) +
labs(x=NULL) +
theme(axis.text.x = element_text(size=10, angle=30))
senti_changes_date <- july_data %>%
select(url, date, topicBy) %>%
mutate(date = format(date, "%Y-%m-%d")) %>%
inner_join(tokens) %>%
inner_join(NTUSD_ch) %>%
group_by(topicBy, date, sentiment) %>%
summarise(count=sum(count))
## Joining, by = c("url", "topicBy")Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
all_dates <- expand.grid(seq(as.Date(min(july_data$date)), as.Date(max(july_data$date)), by="day"), c("positive", "negative"))
colnames(all_dates) <- c("date", "sentiment")
senti_changes_date <- all_dates %>%
merge(senti_changes_date, by.x=c('date', "sentiment"), by.y=c('date', "sentiment"),
all.x=T, all.y=T) %>%
mutate(count = replace_na(count, 0))
senti_changes_date %>%
ggplot()+
geom_line(aes(x=date, y=count, colour=topicBy)) +
facet_wrap(~ sentiment, nrow=2, scales="free_y") +
scale_x_date(labels = date_format("%m/%d"))
## 5-1議題用字
issue_keyword <- read.xlsx("dict/keyword_lists_20191210.xlsx", sheetIndex = 7, encoding = "UTF-8")
issue_keyword <- issue_keyword %>%
select(子類目, 關鍵字) %>%
rename(category=子類目, word=關鍵字)
n_distinct(issue_keyword$category)
## [1] 40
共40個類別
cate_col_names <- unique(issue_keyword$category) %>% as.character()
issue_keyword %>%
group_by(category) %>%
count() %>%
arrange(desc(n))
## # A tibble: 40 x 2
## # Groups: category [40]
## category n
## <fct> <int>
## 1 環境污染與保護 81
## 2 體育 75
## 3 財政稅金制度 67
## 4 科技發展 59
## 5 交通建設 51
## 6 兩岸關係 50
## 7 經濟發展 50
## 8 交通安全 40
## 9 原住民議題 40
## 10 國防安全 40
## # … with 30 more rows
tokens %>%
left_join(issue_keyword, by="word") %>%
group_by(topicBy, category) %>%
filter(!is.na(category)) %>%
count()
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 229 x 3
## # Groups: topicBy, category [229]
## topicBy category n
## <fct> <fct> <int>
## 1 丁守中粉絲團 交通安全 2
## 2 丁守中粉絲團 交通建設 3
## 3 丁守中粉絲團 健康醫療 4
## 4 丁守中粉絲團 公共建設 2
## 5 丁守中粉絲團 公共行政 1
## 6 丁守中粉絲團 國家認同-東京奧運正名「台灣」 4
## 7 丁守中粉絲團 國際外交 6
## 8 丁守中粉絲團 土地正義 3
## 9 丁守中粉絲團 年金改革 1
## 10 丁守中粉絲團 所得分配 1
## # … with 219 more rows
tokens %>%
left_join(issue_keyword, by="word") %>%
group_by(topicBy, category) %>%
filter(!is.na(category)) %>%
count() %>%
ggplot(aes(x=category, y=n)) +
geom_bar(stat = "identity") +
xlab(NULL) +
theme(axis.text.x = element_text(size=6, angle=30))
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`
g <- tokens %>%
left_join(issue_keyword, by="word") %>%
group_by(topicBy, category) %>%
filter(!is.na(category)) %>%
count()
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`
ggplot(g, aes(x=topicBy, y=n)) +
geom_bar(stat = "identity") +
facet_wrap(~ category)+
xlab(NULL) +
theme(axis.text.x = element_text(size=6, angle=30))
tokens %>%
left_join(issue_keyword, by="word") %>%
group_by(topicBy, category) %>%
filter(!is.na(category)) %>%
count() %>%
dcast(topicBy ~ category, value.var = "n", fill = 0) %>%
head()
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## topicBy 交通安全 交通建設 健康醫療 兩岸關係 公共建設 公共行政
## 1 丁守中粉絲團 2 3 4 0 2 1
## 2 侯友宜 0 3 3 0 1 0
## 3 姚文智翻台北 0 0 1 2 2 0
## 4 林佳龍 3 11 4 0 10 0
## 5 林義豐MarkLin 2 4 1 0 0 0
## 6 柯文哲 5 1 6 1 0 0
## 勞工問題 原住民議題 同婚 國家認同-東京奧運正名「台灣」 國防安全
## 1 0 0 0 4 0
## 2 0 0 0 0 1
## 3 0 0 0 0 0
## 4 0 1 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 國際外交 土地正義 城鄉發展差距 年金改革 所得分配 政治貪腐 教育政策
## 1 6 3 0 1 1 2 6
## 2 0 0 0 0 0 0 5
## 3 3 0 0 1 0 0 2
## 4 9 0 0 0 2 0 9
## 5 17 0 0 0 0 0 0
## 6 1 2 0 0 1 0 6
## 文化藝術與文創 族群議題 死刑存廢 災情與災難防治 環境污染與保護 社會治安
## 1 4 2 2 2 1 7
## 2 1 0 0 3 1 2
## 3 2 0 0 2 1 0
## 4 6 1 0 13 8 0
## 5 0 0 0 0 0 0
## 6 12 2 0 11 1 0
## 社會福利 科技發展 經濟發展 能源議題 觀光發展 財政稅金制度 轉型正義
## 1 1 2 6 1 4 0 0
## 2 3 2 0 0 10 0 0
## 3 0 0 0 0 0 0 2
## 4 0 12 1 8 5 1 0
## 5 0 8 0 0 2 0 0
## 6 0 5 2 0 4 0 0
## 農漁業問題 食品安全 體育 高齡社會
## 1 5 0 9 1
## 2 3 0 6 0
## 3 2 1 10 0
## 4 0 3 65 1
## 5 0 0 5 0
## 6 0 0 9 5
output_issue_count_07 <- tokens %>%
left_join(issue_keyword, by="word") %>%
group_by(url, category) %>%
filter(!is.na(category)) %>%
count() %>%
ungroup() %>%
dcast(url ~ category, value.var = "n", fill = 0) %>%
right_join(july, by="url")
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Warning: Factor `category` contains implicit NA, consider using
## `forcats::fct_explicit_na`
output_issue_count_07[is.na(output_issue_count_07)] <- 0
write.xlsx(output_issue_count_07, "output/output_issue_count_07.xlsx")
rm(output_issue_count_07)
DFM: Data Frequency Matrix DTM: Document-term Matrix
july_dtm <- tokens %>%
cast_dtm(url, word, count)
july_dtm
## <<DocumentTermMatrix (documents: 1085, terms: 13509)>>
## Non-/sparse entries: 53765/14603500
## Sparsity : 100%
## Maximal term length: 9
## Weighting : term frequency (tf)
ERROR solution: Each row of the input matrix needs to contain at least one non-zero entry
rowTotals <- apply(july_dtm , 1, sum) #Find the sum of words in each Document
july_dtm <- july_dtm[rowTotals > 0, ] #remove all docs without words
july_lda <- LDA(july_dtm, k = 6, control = list(seed = 1234))
ap_topics <- tidy(july_lda, matrix = "beta")
ap_topics
## # A tibble: 81,054 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 上 0.00161
## 2 2 上 0.000307
## 3 3 上 0.00147
## 4 4 上 0.00146
## 5 5 上 0.00104
## 6 6 上 0.000633
## 7 1 並 0.000380
## 8 2 並 0.00105
## 9 3 並 0.000473
## 10 4 並 0.00102
## # … with 81,044 more rows
tidy(july_lda, matrix = "beta") %>%
filter(!term %in% stopwords) %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
theme(text = element_text(family = "黑體-繁 中黑"))
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_spread
## # A tibble: 279 x 8
## term topic1 topic2 topic3 topic4 topic5 topic6 log_ratio
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 一 1.53e- 3 3.99e- 4 2.48e- 4 1.71e- 3 0.00145 1.15e- 3 -1.94
## 2 一起 2.72e- 3 6.31e- 4 4.80e- 3 2.00e- 3 0.00308 1.05e- 3 -2.11
## 3 丁守中 4.16e- 3 6.76e-37 1.07e- 3 8.55e- 4 0.000996 1.00e-12 -112.
## 4 上 1.61e- 3 3.07e- 4 1.47e- 3 1.46e- 3 0.00104 6.33e- 4 -2.39
## 5 下午 1.01e- 3 9.01e-19 1.22e-20 7.62e- 4 0.000420 4.38e- 4 -50.0
## 6 不 3.86e- 3 2.35e- 3 5.91e- 3 3.12e- 3 0.00326 1.15e- 8 -0.716
## 7 不是 1.63e- 3 1.47e-10 9.02e- 4 3.20e- 4 0.000368 1.58e-14 -23.4
## 8 並 3.80e- 4 1.05e- 3 4.73e- 4 1.02e- 3 0.000768 2.69e- 3 1.46
## 9 中 7.47e- 3 2.45e- 3 2.65e- 3 3.43e- 3 0.00123 7.83e- 4 -1.61
## 10 中壢 5.02e-52 2.72e- 3 3.64e- 4 3.91e-33 0.00137 1.04e- 7 162.
## # … with 269 more rows
beta_spread %>%
top_n(10, log_ratio) %>%
rbind(beta_spread %>% top_n(-10, log_ratio)) %>%
arrange(-log_ratio) %>%
mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(term, log_ratio)) +
geom_col(show.legend = FALSE) +
coord_flip() +
theme(text = element_text(family = "黑體-繁 中黑"))
july_topics <- tidy(july_lda, matrix = "gamma") %>%
group_by(document) %>%
top_n(1, wt = gamma)
head(july_topics)
## # A tibble: 6 x 3
## # Groups: document [6]
## document topic gamma
## <chr> <int> <dbl>
## 1 http://www.facebook.com/109391162488374/posts/17784152555859… 1 0.998
## 2 http://www.facebook.com/109391162488374/posts/17794676688140… 1 0.996
## 3 http://www.facebook.com/109391162488374/posts/17812340386374… 1 0.996
## 4 http://www.facebook.com/109391162488374/posts/17819411885666… 1 0.999
## 5 http://www.facebook.com/109391162488374/posts/17859646314976… 1 0.998
## 6 http://www.facebook.com/109391162488374/posts/17877339813207… 1 0.525
tokens_tf_idf <- tokens %>%
bind_tf_idf(word, topicBy, count) %>%
arrange(desc(tf_idf))
## Warning in bind_tf_idf.data.frame(., word, topicBy, count): A value for tf_idf is negative:
## Input should have exactly one row per document-term combination.
tokens_tf_idf %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(topicBy) %>%
top_n(15) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = topicBy)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~topicBy, ncol = 6, scales = "free") +
coord_flip()
## Selecting by tf_idf
sentiment_tf_idf <- tokens_tf_idf %>%
inner_join(NTUSD_ch) %>%
mutate(sentiment=ifelse(sentiment %in% "positive", 1, -1),
weighted_sentiment=tf_idf*sentiment)
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
sentiment_tf_idf %>%
group_by(url) %>%
summarise(swt = sum(weighted_sentiment)) %>%
left_join(url_person) %>%
ungroup() %>%
group_by(topicBy) %>%
summarise(swt=sum(swt)) %>%
ggplot(aes(x=topicBy, y=swt))+
geom_col() +
theme(axis.text.x=element_text(angle=30))
## Joining, by = "url"
用TF-IDF計算情緒權重全部人都變成負的
各貼文用TF-IDF計算情緒權重
sentiment_tf_idf %>%
group_by(url, topicBy) %>%
summarise(ws=sum(weighted_sentiment)) %>%
arrange(desc(ws)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: url [10]
## url topicBy ws
## <fct> <fct> <dbl>
## 1 http://www.facebook.com/1863023523934803/posts/2207… 韓國瑜 0.00682
## 2 http://www.facebook.com/261813197541354/posts/80316… 侯友宜 0.00450
## 3 http://www.facebook.com/261813197541354/posts/79210… 侯友宜 0.00328
## 4 http://www.facebook.com/261813197541354/posts/80780… 侯友宜 0.00314
## 5 http://www.facebook.com/1863023523934803/posts/2233… 韓國瑜 0.00310
## 6 http://www.facebook.com/339483189800311/posts/52640… 高思博 A Po 0.00287
## 7 http://www.facebook.com/1912587195640543/posts/2172… 林義豐MarkLin… 0.00258
## 8 http://www.facebook.com/1863023523934803/posts/2228… 韓國瑜 0.00257
## 9 http://www.facebook.com/1380211668909443/posts/2044… 姚文智翻台北… 0.00234
## 10 http://www.facebook.com/136845026417486/posts/13844… 柯文哲 0.00223
july_dtm_tf_idf <- tokens %>%
cast_dtm(url, word, count, weighting = tm::weightTfIdf)
july_dtm_tf_idf
## <<DocumentTermMatrix (documents: 1085, terms: 13509)>>
## Non-/sparse entries: 53765/14603500
## Sparsity : 100%
## Maximal term length: 9
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
july_dtm_tf_idf <- removeSparseTerms(july_dtm_tf_idf, sparse=.99)
library(text2vec)
##
## Attaching package: 'text2vec'
## The following object is masked from 'package:topicmodels':
##
## perplexity
article_words <- lapply(as.character(july_sentences$sentence),
function(x) {
tokens <- segment(x, jieba_tokenizer)
tokens <- tokens[!str_detect(tokens, regex("[0-9a-zA-Z]"))]
tokens <- tokens[!(tokens %in% stopwords)]
return(tokens)
})
# unique word vector
tokens_vec <- itoken(article_words)
# word frequency document
vocab_vec <- create_vocabulary(tokens_vec, ngram=c(1,1))
tail(vocab_vec, 20)
## Number of docs: 655
## 0 stopwords: ...
## ngram_min = 1; ngram_max = 1
## Vocabulary:
## term term_count doc_count
## 1: 城市 150 96
## 2: 發展 156 97
## 3: 高雄 161 53
## 4: 市民 162 110
## 5: 月 163 91
## 6: 今天 169 141
## 7: 未來 172 127
## 8: 年 175 102
## 9: 台中 178 90
## 10: 陳其邁 179 46
## 11: 一起 189 136
## 12: 文化 189 87
## 13: 新 189 95
## 14: 篇 207 10
## 15: 中 224 146
## 16: 台灣 226 123
## 17: 話 237 13
## 18: 思博新政 248 10
## 19: 市長 338 198
## 20: 桃園 402 125
# it <- itoken(tokens, ids = movie_review$id[1:4000], progressbar = FALSE)
it <- itoken(article_words)
v <- create_vocabulary(it)
v <- prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it, vectorizer, type = "dgTMatrix")
lda_model <- LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(x = dtm, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
## INFO [2020-01-20 09:18:51] iter 25 loglikelihood = -165219.461
## INFO [2020-01-20 09:18:51] iter 50 loglikelihood = -161820.825
## INFO [2020-01-20 09:18:51] iter 75 loglikelihood = -160333.801
## INFO [2020-01-20 09:18:51] iter 100 loglikelihood = -159143.744
## INFO [2020-01-20 09:18:51] iter 125 loglikelihood = -158659.131
## INFO [2020-01-20 09:18:51] iter 150 loglikelihood = -158358.275
## INFO [2020-01-20 09:18:51] iter 175 loglikelihood = -158234.676
## INFO [2020-01-20 09:18:51] early stopping at 175 iteration
barplot(doc_topic_distr[1, ], xlab = "topic",
ylab = "proportion", ylim = c(0, 1),
names.arg = 1:ncol(doc_topic_distr))
lambda為1各主題前10個字
lda_model$get_top_words(n = 10, topic_number = c(1L, 5L, 10L), lambda = 1)
## [,1] [,2] [,3]
## [1,] "陳其邁" "服務" "台北"
## [2,] "高雄" "空間" "丁守中"
## [3,] "台南" "捷運" "柯文哲"
## [4,] "議員" "更新" "經濟"
## [5,] "高雄市" "提供" "新"
## [6,] "參選人" "蘇貞昌" "支持"
## [7,] "感謝" "團隊" "未來"
## [8,] "加入" "推動" "拚"
## [9,] "陳" "市府" "計畫"
## [10,] "黃" "現在" "政策"
lda_model$get_top_words(n = 10, topic_number = c(1L, 5L, 10L), lambda = 0.2)
## [,1] [,2] [,3]
## [1,] "陳其邁" "蘇貞昌" "丁守中"
## [2,] "高雄" "里長" "柯文哲"
## [3,] "議員" "更新" "拚"
## [4,] "高雄市" "興建" "台北"
## [5,] "黃" "金山" "住宅"
## [6,] "立即" "機捷" "驕傲"
## [7,] "陳" "衝衝衝" "產值"
## [8,] "表示" "花博" "心"
## [9,] "參選人" "營運" "高中"
## [10,] "立委" "活動中心" "建立"
lda_model$plot()
## Loading required namespace: servr
vectorizer_vec <- vocab_vectorizer(vocab_vec)
# tcm = term co-occurence matrix
tcm_vec <- create_tcm(tokens_vec, vectorizer_vec)
計算 tcm tem co-occurrence matrix
glove_model <- GlobalVectors$new(word_vectors_size = 50, vocabulary = create_vocabulary(tokens_vec), x_max = 10)
glove <- glove_model$fit_transform(tcm_vec, n_iter = 20)
## INFO [2020-01-20 09:18:53] 2020-01-20 09:18:53 - epoch 1, expected cost 0.0652
## INFO [2020-01-20 09:18:54] 2020-01-20 09:18:54 - epoch 2, expected cost 0.0379
## INFO [2020-01-20 09:18:54] 2020-01-20 09:18:54 - epoch 3, expected cost 0.0294
## INFO [2020-01-20 09:18:55] 2020-01-20 09:18:55 - epoch 4, expected cost 0.0243
## INFO [2020-01-20 09:18:55] 2020-01-20 09:18:55 - epoch 5, expected cost 0.0208
## INFO [2020-01-20 09:18:55] 2020-01-20 09:18:55 - epoch 6, expected cost 0.0181
## INFO [2020-01-20 09:18:56] 2020-01-20 09:18:56 - epoch 7, expected cost 0.0160
## INFO [2020-01-20 09:18:56] 2020-01-20 09:18:56 - epoch 8, expected cost 0.0144
## INFO [2020-01-20 09:18:57] 2020-01-20 09:18:57 - epoch 9, expected cost 0.0130
## INFO [2020-01-20 09:18:57] 2020-01-20 09:18:57 - epoch 10, expected cost 0.0119
## INFO [2020-01-20 09:18:57] 2020-01-20 09:18:57 - epoch 11, expected cost 0.0109
## INFO [2020-01-20 09:18:58] 2020-01-20 09:18:58 - epoch 12, expected cost 0.0101
## INFO [2020-01-20 09:18:58] 2020-01-20 09:18:58 - epoch 13, expected cost 0.0094
## INFO [2020-01-20 09:18:59] 2020-01-20 09:18:59 - epoch 14, expected cost 0.0087
## INFO [2020-01-20 09:18:59] 2020-01-20 09:18:59 - epoch 15, expected cost 0.0082
## INFO [2020-01-20 09:19:00] 2020-01-20 09:18:59 - epoch 16, expected cost 0.0077
## INFO [2020-01-20 09:19:00] 2020-01-20 09:19:00 - epoch 17, expected cost 0.0072
## INFO [2020-01-20 09:19:00] 2020-01-20 09:19:00 - epoch 18, expected cost 0.0069
## INFO [2020-01-20 09:19:01] 2020-01-20 09:19:01 - epoch 19, expected cost 0.0065
## INFO [2020-01-20 09:19:01] 2020-01-20 09:19:01 - epoch 20, expected cost 0.0062
## pca to see words location
pca = prcomp(glove, scale. = T, center = T)
# rownames(pca$x) = rownames(word.vec)
i = grep('市長|', rownames(pca$x))
j = grep('台北|架子|勤奮',rownames(pca$x)[i])
plot(pca$x[i[-j] ,2], pca$x[i[-j] ,3], type = 'b', lty = 2, pch = 'x', col = 'red', xlim = c(0.8,-2), ylim = c(4,-1))
text(pca$x[i[-j] ,2], pca$x[i[-j] ,3], labels = rownames(pca$x)[i[-j]], pos = 4)
# i = grep('姚文智|丁能兒', rownames(pca$x))
# j = grep('市長|柯文哲|丁守中|姚文智',rownames(pca$x)[i])
# points(pca$x[i[-j] ,2], pca$x[i[-j] ,3], type = 'b', lty = 2, pch = 'x', col = 'red')
# text(pca$x[i[-j] ,2], pca$x[i[-j] ,3], labels = rownames(pca$x)[i[-j]], pos = 4)
k <- kmeans(glove, 300, iter.max = 100)
unique(k$cluster)
## [1] 242 178 163 224 57 107 56 273 9 284 1 283 20 276 52 205 112
## [18] 48 233 124 279 30 12 32 169 28 75 139 260 170 127 294 116 16
## [35] 209 159 286 146 73 221 143 34 114 263 19 42 14 10 97 131 231
## [52] 189 21 85 281 44 206 121 63 54 200 37 89 74 228 156 123 289
## [69] 187 270 196 148 5 25 59 236 295 49 157 96 202 18 293 93 110
## [86] 4 190 181 82 174 100 158 70 203 39 91 299 80 185 204 193 128
## [103] 162 43 234 198 66 101 117 183 250 151 251 144 8 232 6 291 72
## [120] 226 77 225 256 300 130 90 168 223 13 239 133 227 113 220 199 212
## [137] 188 45 230 208 177 173 87 255 207 50 109 160 84 216 23 103 182
## [154] 118 105 180 24 94 115 248 152 218 240 38 217 92 269 53 186 102
## [171] 257 88 136 237 134 252 147 265 81 126 55 261 229 172 132 104 60
## [188] 122 266 215 141 62 176 15 241 76 282 33 35 288 129 175 71 238
## [205] 64 272 27 7 297 296 155 69 264 287 79 262 137 267 86 145 197
## [222] 271 140 65 61 119 153 138 278 26 235 211 161 31 51 275 2 222
## [239] 149 166 201 245 58 95 285 36 135 244 67 40 171 17 194 83 195
## [256] 165 111 254 213 142 120 179 98 280 247 22 29 210 290 191 274 249
## [273] 47 154 268 219 243 259 68 150 106 167 46 125 108 99 192 214 184
## [290] 298 292 78 164 41 258 277 11 253 3 246
https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/
library(stm)
## stm v1.3.5 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(stminsights)
library(quanteda)
## Package version: 1.5.2
## Parallel computing: 2 of 24 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
##
## View
# dfm2stm <- convert(july_dtm_tf_idf, to = "stm")
# model.stm <- stm(
# dfm2stm$documents,
# dfm2stm$vocab,
# K = topic.count,
# data = dfm2stm$meta,
# init.type = "Spectral"
# )
https://www.facebook.com/339483189800311/posts/532367277178567 高思博這篇名字計算為37次,高思博常在貼文後加很多hashtag,或整理文章,像思博新政
【思博新政-第一話】治安篇 https://tinyurl.com/apokao1
【思博新政-第二話】水資源篇 https://tinyurl.com/apokaowater
這種好幾話,導致思博、新政出現頻率很高。 > 已將「思博新政」作為詞集
7月~11月各候選人議題變化
根據資管碩士論文:《應用情感分析於輿情之研究-以台灣2016總統選舉為例》(2014)發現,NTUSD字典中相當缺乏政治領域以及網友評論之字詞,例如:賣台、挺柱、下台等,需擴充種子磁集。
需要新增ChineseWordnet、SentiWordNet,情緒分數做正規化(Z-score) https://liferay.de.dariah.eu/tatom/feature_selection.html