PTT八卦版:鄉民因為南投縣政府要求論壇刪除特定文章,引起鄉民認為政府限制言論自由的討論分析。
2020年於南投縣旭光高中發生一起震驚社會的性侵案件,加害人的父親試圖利用派出所巡佐兼副所長身份來包庇兒子,引起社會公眾的不滿。因此,我們嘗試使用文字探勘以及情緒分析的方式來探討廣大網民對於此事件的看法。
這次我們以最近發生的TSJ事件,主要分析ptt上網友的相關討論,本次主要針對以下方向分析:
1.TSJ事件討論大概出現在哪個時間點,話題高峰在哪裡? 2.正面和負面的討論內容各是甚麼,有沒有時間點上的差異? 3.正面和負面討論的情緒分數大約多少?
系統參數設定
Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼## [1] ""
安裝需要的packages
# echo = T,results = 'hide'
packages = c("dplyr", "tidytext", "stringr", "wordcloud2", "ggplot2",'readr','data.table','reshape2','wordcloud','tidyr','scales')
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)讀進library
rm(list=ls(all=T))
library(dplyr)
library(stringr)
library(tidytext)
library(wordcloud2)
library(data.table)
library(ggplot2)
library(reshape2)
library(wordcloud)
library(tidyr)
library(readr)
library(scales)
require(jiebaR)
library(janeaustenr)
library(ngram)
require(widyr)
require(readr)
require(NLP)
require(ggraph)
require(igraph)# 把文章和留言讀進來
MetaData = fread('csv/tsj_a.csv',encoding = 'UTF-8')
Reviews = fread('csv/tsj_r.csv',encoding = 'UTF-8')
# 再篩一次文章,從488篩到剩下201
keywords = c('TSJ','口交','惡徒','狂魔','吹狂魔','田勝傑','田聖傑','田裕璋','tsj','旭光')
toMatch = paste(keywords,collapse="|")
MetaData = with(MetaData, MetaData[grepl(toMatch,sentence)|grepl(toMatch,artTitle),])
# 挑選文章對應的留言,從40058到18372
Reviews = left_join(MetaData, Reviews[,c("artUrl", "cmtContent")], by = "artUrl")(1). 文章斷詞
設定斷詞引擎
# 加入自定義的字典
jieba_tokenizer <- worker(user="dict/user_dict.txt", stop_word = "dict/stop_words.txt")
# 設定斷詞function
customized_tokenizer <- function(t) {
lapply(t, function(x) {
tokens <- segment(x, jieba_tokenizer)
return(tokens)
})
}# 把文章和留言的斷詞結果併在一起
MToken <- MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
RToken <- Reviews %>% unnest_tokens(word, cmtContent, token=customized_tokenizer)
# 把資料併在一起
data <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl", "word")]) (2). 資料基本清理
# 格式化日期欄位
data$artDate= data$artDate %>% as.Date("%Y/%m/%d")
#只取tsj
data_tsj = data %>% filter(word == "tsj")
# 過濾特殊字元
data_select = data %>%
filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
filter(nchar(.$word)>1)
data_select = bind_rows(data_tsj,data_select)
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select %>%
select(artDate,word) %>%
group_by(artDate,word) %>%
summarise(count=n()) %>% # 算字詞單篇總數用summarise
filter(count>3) %>% # 過濾出現太少次的字
arrange(desc(count))## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
(3). 統計每日貼文數量
#每日統計貼文
date = data %>% select(artDate, artUrl) %>% distinct()
date = date %>% group_by(artDate) %>% summarize(count_day = n()) %>% ungroup()
date = date %>% arrange(desc(count_day))date_plot <- date %>%
ggplot(aes(x = artDate, y = count_day)) +
geom_line(color = "purple", size = 1.5) +
geom_vline(xintercept = c(as.numeric(as.Date("2021-03-23")),
as.numeric(as.Date("2020-10-23")),
as.numeric(as.Date("2020-09-10"))
), col='red', size = 1) +
scale_x_date(labels = date_format("%Y/%m/%d")) +
ggtitle("「TSJ」討論文章數") +
xlab("日期") +
ylab("數量")
date_plot圖中發現有三個時間點
讀檔,字詞間以“,”將字分隔
P <- read_file("dict/liwc/positive.txt") # 正向字典txt檔
N <- read_file("dict/liwc/negative.txt") # 負向字典txt檔
#字典txt檔讀進來是一整個字串
typeof(P)## [1] "character"
分割字詞,並將兩個情緒字典併在一起
# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]
# 建立dataframe 有兩個欄位word,sentiments,word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #661
N = data.frame(word = N, sentiment = "negative") #1049
# 把兩個字典拼在一起
LIWC = rbind(P, N)###查看發文次數最高的三天,最常出現的詞彙
g_tokens_by_date <- data_select %>% count(artDate, word, sort = TRUE)
plot_merge <- g_tokens_by_date %>%
filter(artDate == as.Date("2021-03-23") |
artDate == as.Date("2020-10-23") |
artDate == as.Date("2020-09-10")) %>%
group_by(artDate) %>%
top_n(7, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x=word, y=n, fill = artDate)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = NULL) +
facet_wrap(~artDate, scales="free", ncol = 2) +
coord_flip()
plot_merge2020/09/10: 當天新聞標題<警察兒性侵女同學,警駁吃案遭蓋樓罵爆!緊急刪文分局長道歉>,加害人tsj的父親是南投縣草屯分局副所長,因此當被爆出吃案的消息後,網友利用經典橋段「好大的官威」,來表示不滿。
2020/10/23: 由於有外界公權力介入,當天出現很多「自殺文」,也就是挑戰公權力,故意發文提到相關敏感字眼,而ptt的網友都會在回文處回復「勇者」來讚揚貼文者不畏強權的態度,另外「有聲音」是因為有網友自行製作海綿寶寶的迷因圖,回文者紛紛表示「有畫面又有聲音」,來認可貼文者。
2021/03/21: 由於雙方未成年,因此並未公開少年的真實姓名,縣府多次要求PTT、Dcard、巴哈姆特等網路平台刪文,並以《兒少法》對巴哈姆特開罰6萬元。對此,網紅「小商人」23日直接在臉書公布少年姓名。
ngram_11 <- function(t) {
lapply(t, function(x) {
tokens <- segment(x, jieba_tokenizer)
ngram <- ngrams(tokens, 11)
ngram <- lapply(ngram, paste, collapse = " ")
unlist(ngram)
})
}
data_tsj <- MetaData %>%
select(artUrl, sentence) %>%
unnest_tokens(word, sentence, token = customized_tokenizer)
data_tsj= data_tsj %>% filter(data_tsj$word =="tsj"|!str_detect(word, regex("[0-9a-zA-Z]")))
w = data_tsj %>% group_by(artUrl) %>% summarize(sentence = paste(word,collapse = ""))
g_ngram_11 <- w %>%
select(artUrl, sentence) %>%
unnest_tokens(ngram, sentence, token = ngram_11)
g_ngrams_11_separated <- g_ngram_11 %>%
separate(ngram, paste0("word", c(1:11),sep=""), sep = " ")1.常見於“田勝傑”附近的字詞分析
g_check_words <- g_ngrams_11_separated %>%
filter(word6 == "田勝傑")g_check_words_count <- g_check_words %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word=value) %>%
filter(variable!="word6") %>%
filter(!(word %in% stop_words), nchar(word)>1) %>%
count(word, sort = TRUE)
g_check_words_count %>%
arrange(desc(abs(n))) %>%
head(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = n > 0)) +
geom_col(show.legend = FALSE) +
xlab("出現在「田勝傑」附近的字") +
ylab("出現次數") +
coord_flip()2.常見於“tsj”附近的字詞分析
g_check_words <- g_ngrams_11_separated %>%
filter(word6 == "tsj")g_check_words_count <- g_check_words %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word=value) %>%
filter(variable!="word6") %>%
filter(!(word %in% stop_words), nchar(word)>1) %>%
count(word, sort = TRUE)
g_check_words_count %>%
arrange(desc(abs(n))) %>%
head(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = n > 0)) +
geom_col(show.legend = FALSE) +
xlab("出現在「tsj」附近的字") +
ylab("出現次數") +
coord_flip()g_words_by_art <- data_select %>%
count(artUrl, word, sort = TRUE)
g_word_pairs <- g_words_by_art %>%
pairwise_count(word, artUrl, sort = TRUE)## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
g_word_cors <- g_words_by_art %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, artUrl, sort = TRUE)threshold <- 0.65 #手動調參
remove_words <- g_word_cors %>%
filter(correlation>threshold) %>%
.$item1 %>%
unique()
set.seed(2017)
g_word_cors_new <- g_word_cors %>%
filter(!(item1 %in% remove_words|item2 %in% remove_words))
g_word_cors_new %>%
filter(correlation > .505) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()算出每天情緒總和(sentiment_count)
# sentiment_count:artDate,sentiment,count
sentiment_count = data_select %>%
select(artDate,word) %>%
inner_join(LIWC) %>%
group_by(artDate,sentiment) %>%
summarise(count=n()) ## Joining, by = "word"
## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
# 檢視資料的日期區間
range(sentiment_count$artDate) #"2020-09-08" "2021-03-25"## [1] "2020-09-09" "2021-03-24"
sentiment_count %>%
ggplot()+
geom_line(aes(x=artDate,y=count,colour=sentiment))+
scale_x_date(labels = date_format("%Y/%m/%d"),
limits = as.Date(c('2020-09-08','2021-03-25'))
)+
# 加上標示日期的線
geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2020-09-10 '))
[1]])),colour = "green",linetype=4) +
# 加上標示日期的線
geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2020-10-23 '))
[1]])),colour = "green",linetype=4) +
# 加上標示日期的線
geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-03-21 '))
[1]])),colour = "green",linetype=4) sentiment_count %>%
# 標準化的部分
group_by(artDate) %>%
mutate(ratio = count/sum(count)) %>%
# 畫圖的部分
ggplot()+
geom_line(aes(x=artDate,y=ratio,colour=sentiment))+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2020-09-08','2021-03-25'))
)1.分析2020/09/10前後五天的情緒
sentiment_count %>% filter(artDate<=as.Date("2020-09-17",format="%Y-%m-%d"))%>%
# 標準化的部分
group_by(artDate) %>%
mutate(ratio = count/sum(count)) %>%
# 畫圖的部分
ggplot()+
geom_line(aes(x=artDate,y=ratio,colour=sentiment))+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2020-09-08','2020-09-17'))
)+
# 加上標示日期的線
geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2020-09-10'))
[1]])),colour = "black",linetype=4)2.分析2020/10/23前後五天的情緒
sentiment_count %>% filter(artDate<=as.Date("2020-10-28",format="%Y-%m-%d")&artDate>=as.Date("2020-10-18",format="%Y-%m-%d"))%>%
# 標準化的部分
group_by(artDate) %>%
mutate(ratio = count/sum(count)) %>%
# 畫圖的部分
ggplot()+
geom_line(aes(x=artDate,y=ratio,colour=sentiment))+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2020-10-18','2020-10-28'))
)+
geom_vline(aes(xintercept = as.integer(as.Date("2020-10-23",format="%Y-%m-%d"))), col = "black",linetype=4)3.分析2021/03/23前後五天的情緒
sentiment_count %>% filter(artDate<=as.Date("2021-03-25",format="%Y-%m-%d")&artDate>=as.Date("2021-03-18",format="%Y-%m-%d"))%>%
group_by(artDate) %>%
mutate(ratio = count/sum(count)) %>%
ggplot()+
geom_line(aes(x=artDate,y=ratio,colour=sentiment))+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2021-03-16','2021-03-25'))
)+
geom_vline(aes(xintercept = as.integer(as.Date("2021-03-23",format="%Y-%m-%d"))), col = "black",linetype=4)1.2020-09-10文字雲
# 畫出文字雲
word_count %>%
filter(artDate == as.Date('2020-09-10')) %>%
select(word,count) %>%
group_by(word) %>%
summarise(count = sum(count)) %>%
arrange(desc(count)) %>%
filter(count>20) %>% # 過濾出現太少次的字
wordcloud2()## Adding missing grouping variables: `artDate`
2.2020-09-10正負情緒代表字
# sentiment_sum:word,sentiment,sum
sentiment_sum <-
word_count %>%
filter(artDate == as.Date('2020-09-10')) %>%
inner_join(LIWC) %>%
group_by(word,sentiment) %>%
summarise(
sum = sum(count)
) %>%
arrange(desc(sum)) %>%
data.frame() ## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
sentiment_sum %>%
top_n(20,wt = sum) %>%
mutate(word = reorder(word, sum)) %>%
ggplot(aes(word, sum, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
theme(text=element_text(size=14))+
coord_flip()3.2020-09-10正負情緒文字雲
sentiment_sum %>%
acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
comparison.cloud(
colors = c("salmon", "#72bcd4"), # positive negative
max.words = 50)1.2020-10-23文字雲
# 畫出文字雲
word_count %>%
filter(artDate == as.Date('2020-10-23')) %>%
select(word,count) %>%
group_by(word) %>%
summarise(count = sum(count)) %>%
arrange(desc(count)) %>%
filter(count>30) %>% # 過濾出現太少次的字
wordcloud2()## Adding missing grouping variables: `artDate`
2.2020-10-23正負情緒代表字
# sentiment_sum:word,sentiment,sum
sentiment_sum <-
word_count %>%
filter(artDate == as.Date('2020-10-23')) %>%
inner_join(LIWC) %>%
group_by(word,sentiment) %>%
summarise(
sum = sum(count)
) %>%
arrange(desc(sum)) %>%
data.frame() ## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
sentiment_sum %>%
top_n(20,wt = sum) %>%
mutate(word = reorder(word, sum)) %>%
ggplot(aes(word, sum, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
theme(text=element_text(size=14))+
coord_flip()3.2020-10-23正負情緒文字雲
sentiment_sum %>%
acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
comparison.cloud(
colors = c("salmon", "#72bcd4"), # positive negative
max.words = 50)1.2021-03-23文字雲
# 畫出文字雲
word_count %>%
filter(artDate == as.Date('2021-03-23')) %>%
select(word,count) %>%
group_by(word) %>%
summarise(count = sum(count)) %>%
arrange(desc(count)) %>%
filter(count>15) %>% # 過濾出現太少次的字
wordcloud2()## Adding missing grouping variables: `artDate`
2.2021-03-23正負情緒代表字
# sentiment_sum:word,sentiment,sum
sentiment_sum <-
word_count %>%
filter(artDate == as.Date('2021-03-23')) %>%
inner_join(LIWC) %>%
group_by(word,sentiment) %>%
summarise(
sum = sum(count)
) %>%
arrange(desc(sum)) %>%
data.frame() ## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
sentiment_sum %>%
top_n(30,wt = sum) %>%
mutate(word = reorder(word, sum)) %>%
ggplot(aes(word, sum, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
theme(text=element_text(size=14))+
coord_flip()3.2021-03-23正負情緒文字雲
sentiment_sum %>%
acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
comparison.cloud(
colors = c("salmon", "#72bcd4"), # positive negative
max.words = 50)之前的情緒分析大部分是全部的詞彙加總,接下來將正負面情緒的文章分開,看看能不能發現一些新的東西。接下來歸類文章,將每一篇文章正負面情緒的分數算出來,然後大概分類文章屬於正面還是負面。
# 依據情緒值的正負比例歸類文章
article_type =
data_select %>%
inner_join(LIWC) %>%
group_by(artUrl,sentiment) %>%
summarise(count=n()) %>%
spread(sentiment,count,fill = 0) %>% #把正負面情緒展開,缺值補0
mutate(type = case_when(positive > negative ~ "positive",
TRUE ~ "negative")) %>%
data.frame() ## Joining, by = "word"
## `summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
# 看一下正負比例的文章各有幾篇
article_type %>%
group_by(type) %>%
summarise(count = n())## # A tibble: 2 x 2
## type count
## * <chr> <int>
## 1 negative 148
## 2 positive 49
#
article_type_date = left_join(article_type[,c("artUrl", "type")], MetaData[,c("artUrl", "artDate")], by = "artUrl")
article_type_date$artDate = as.Date(article_type_date$artDate,format="%Y/%m/%d")
article_type_date %>%
group_by(artDate,type) %>%
summarise(count = n()) %>%
ggplot(aes(x = artDate, y = count, fill = type)) +
geom_bar(stat = "identity", position = "dodge")+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2020-09-01','2021-03-25'))
)## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
把正面和負面的文章挑出來,並和斷詞結果合併。
# negative_article:artUrl,word
negative_article <-
article_type %>%
filter(type=="negative")%>%
select(artUrl) %>%
left_join(data_select[,c("artUrl", "word")], by = "artUrl")
# positive_article:artUrl,word
positive_article <-
article_type %>%
filter(type=="positive")%>%
select(artUrl) %>%
left_join(data_select[,c("artUrl", "word")], by = "artUrl")畫出正負面文章情緒貢獻度較高的關鍵字
# 負面情緒關鍵字貢獻圖
negative_article %>%
inner_join(LIWC) %>%
group_by(word,sentiment) %>%
summarise(
sum = n()
)%>%
arrange(desc(sum)) %>%
data.frame() %>%
top_n(30,wt = sum) %>%
ungroup() %>%
mutate(word = reorder(word, sum)) %>%
ggplot(aes(word, sum, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to negative sentiment",
x = NULL) +
theme(text=element_text(size=14))+
coord_flip()## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
# 正面情緒關鍵字貢獻圖
positive_article %>%
inner_join(LIWC) %>%
group_by(word,sentiment) %>%
summarise(
sum = n()
)%>%
arrange(desc(sum)) %>%
data.frame() %>%
top_n(30,wt = sum) %>%
ungroup() %>%
mutate(word = reorder(word, sum)) %>%
ggplot(aes(word, sum, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to positive sentiment",
x = NULL) +
theme(text=element_text(size=14))+
coord_flip()## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
在對整個事件的分析中,可以看出網民們對此事的態度看法基本都是持有批評、諷刺態度。從文章討論數分析可以看到性侵事件本身討論度不大、曝光率不高。反而是在南投縣政府發函強制要求平台刪文,網民們對政府行為的反應更激烈,特別是在政府再次發函要求平台刪文,平台拒絕刪文遭罰款後,討論度達到最高。說明網民對政府的強制性行為更加抵觸和反抗,越打壓就越討論。後面「自殺文」的出現與網友瘋狂回復“勇者”也充分說明了網民對強權態度的反抗。