系統參數設定

Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼
## [1] ""

安裝需要的packages

# echo = T,results = 'hide'
packages = c("dplyr", "tidytext", "stringr", "wordcloud2", "ggplot2",'readr','data.table','reshape2','wordcloud','tidyr','scales')
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

讀進library

library(dplyr)
library(stringr)
library(tidytext)
library(wordcloud2)
library(data.table)
library(ggplot2)
library(reshape2)
library(wordcloud)
library(tidyr)
library(readr)
library(scales)
require(jiebaR)

資料基本介紹

  • 資料來源: 文字平台收集PTT Gossip版2021-01-01 ~ 2021-03-20 所有文章
  • 資料集: salmon_articleMetaData.csv
  • 關鍵字:翁立友、雞排妹、鄭家純
  • 資料時間:2021-01-29 ~ 2021-02-08

根據2021年初雞排妹控訴翁立友性騷擾事件做分析,主要分析ptt上網友的相關討論。本次主要針對以下方向分析:

1.討論大概出現在哪個時間點,話題高峰在哪裡?
2.正面和負面的討論內容各是甚麼,有沒有時間點上的差異?
3.正面和負面討論的情緒分數大約多少?
4.翁立友與雞排妹兩人的討論度?

ptt有些文章有「翁立友、雞排妹、鄭家純」關鍵字。

# 把文章和留言讀進來
MetaData = fread('../hw_data/ptt_weng_articleMetaData.csv',encoding = 'UTF-8')
Reviews  = fread('../hw_data/ptt_weng_articleReviews.csv',encoding = 'UTF-8')

# 再篩一次文章  篇
keywords = c('翁立友','雞排妹','鄭家純')
toMatch = paste(keywords,collapse="|")
MetaData = with(MetaData, MetaData[grepl(toMatch,sentence)|grepl(toMatch,artTitle),])

# 挑選文章對應的留言
Reviews = left_join(MetaData, Reviews[,c("artUrl", "cmtContent")], by = "artUrl")

1. 資料前處理

(1). 文章斷詞

設定斷詞引擎

# 加入自定義的字典
jieba_tokenizer <- worker(user="../dict/user_dict.txt", stop_word = "C:/Users/user/Documents/M/sma/R/20210323/dict/stop_words.txt")

# 設定斷詞function
customized_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    return(tokens)
  })
}
# 把文章和留言的斷詞結果併在一起
MToken <- MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
RToken <- Reviews %>% unnest_tokens(word, cmtContent, token=customized_tokenizer)

# 把資料併在一起
data <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl", "word")]) 

(2). 資料基本清理

  • 日期格式化
  • 去除特殊字元、詞頻太低的字
# 格式化日期欄位
data$artDate= data$artDate %>% as.Date("%Y/%m/%d")

# 過濾特殊字元
data_select = data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1) 
  
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))
## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
word_count
## # A tibble: 17,210 x 3
## # Groups:   artDate [35]
##    artDate    word   count
##    <date>     <chr>  <int>
##  1 2021-02-05 雞排妹  1962
##  2 2021-02-05 性騷擾  1282
##  3 2021-02-03 雞排妹  1124
##  4 2021-02-04 雞排妹   994
##  5 2021-02-03 性騷擾   980
##  6 2021-02-06 雞排妹   975
##  7 2021-02-03 飛機杯   946
##  8 2021-02-05 道歉     876
##  9 2021-02-05 覺得     865
## 10 2021-02-04 性騷擾   794
## # ... with 17,200 more rows

2. 準備LIWC字典

全名Linguistic Inquiry and Word Counts,由心理學家Pennebaker於2001出版 分為正向情緒與負向情緒

讀檔,字詞間以“,”將字分隔

P <- read_file("../dict/liwc/positive.txt") # 正向字典txt檔
N <- read_file("../dict/liwc/negative.txt") # 負向字典txt檔

#字典txt檔讀進來是一整個字串
typeof(P)
## [1] "character"

分割字詞,並將兩個情緒字典併在一起

# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]

# 建立dataframe 有兩個欄位word,sentiments,word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047

# 把兩個字典拼在一起
LIWC = rbind(P, N)

# 檢視字典
head(LIWC)
##       word sentiment
## 1     一流  positive
## 2 下定決心  positive
## 3 不拘小節  positive
## 4   不費力  positive
## 5     不錯  positive
## 6     主動  positive

3. 將文章和與LIWC情緒字典做join

在畫出情緒之前,先看看每天的發文情形,大約在3.15之後才有較多的討論。

正負情緒發文折線圖

MetaData$artDate= MetaData$artDate %>% as.Date("%Y/%m/%d")
MetaData %>%
  group_by(artDate) %>%
  summarise(count = n()) %>%
  ggplot()+
    geom_line(aes(x=artDate,y=count))+
    scale_x_date(labels = date_format("%m/%d"))

找出文集中,對於LIWC字典是positive和negative的字

算出每天情緒總和(sentiment_count)

# sentiment_count:artDate,sentiment,count
sentiment_count = data_select %>%
  select(artDate,word) %>%
  inner_join(LIWC) %>% 
  group_by(artDate,sentiment) %>%
  summarise(count=n())  
## Joining, by = "word"
## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

畫出每天的情緒總分數,可以看到大概在3/18後,短短的幾天內,情緒從正面為主轉為負面為主。約在20號之後討論度逐漸下降。

正負情緒分數折線圖

# 檢視資料的日期區間
range(sentiment_count$artDate) #"2021-03-03" "2021-03-21"
## [1] "2021-01-29" "2021-03-20"
sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-01-29','2021-02-12'))
               )
## Warning: Removed 46 row(s) containing missing values (geom_path).

  # 加上標示日期的線
  #+geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-03-18')) [1]])),colour = "red") 

將情緒分數標準化後再畫一次圖,可以發現雖然正負面情緒有波動,但大部分正負面情緒各半,約在3/18後負面情緒佔比較高。

正負情緒比例折線圖

sentiment_count %>% 
  # 標準化的部分
  group_by(artDate) %>%
  mutate(ratio = count/sum(count)) %>%
  # 畫圖的部分
  ggplot()+
  geom_line(aes(x=artDate,y=ratio,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-01-29','2021-02-12'))
               )
## Warning: Removed 46 row(s) containing missing values (geom_path).

  # 加上標示日期的線
  #+geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-03-19')) [1]])),colour = "red")

我們挑出幾個情緒高點的日期 觀察每日情緒分數,約從16號開始議題被大量討論,19達到議題高峰,之後就慢慢下降。

# 查看每天的情緒分數排名
sentiment_count %>%
  select(count,artDate) %>%
  group_by(artDate) %>%
  summarise(sum = sum(count)) %>%
  arrange(desc(sum))
## # A tibble: 35 x 2
##    artDate      sum
##    <date>     <int>
##  1 2021-02-05 12455
##  2 2021-02-03  6891
##  3 2021-02-04  5999
##  4 2021-02-06  5198
##  5 2021-02-02  3423
##  6 2021-02-07  3177
##  7 2021-02-08  1313
##  8 2021-02-15   697
##  9 2021-02-17   649
## 10 2021-02-16   586
## # ... with 25 more rows

4. 畫出文字雲

挑出有興趣的日期,畫出文字雲看看都在討論甚麼主題。

先從2021-02-03的雞排妹開記者會,呼應上面負面的情緒分析,出現「性騷擾」、「噁心」等詞彙。但也出現「證據」、「覺得」、「應該」等詞彙,推測是因為沒有提出有力證據而大部分人抱著懷疑的態度。此外,在記者會上雞排妹在桌上擺放飛機杯也引發相當大的討論度,甚至比翁立友還大(不知道為什麼文字雲出不來QQ)

2021-02-03 文字雲

# 畫出文字雲

word_count %>%
  #filter(!(word %in% c("真的"))) %>%
  filter(artDate == as.Date('2021-02-03')) %>% 
  select(word,count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count)) %>%
  arrange(desc(count)) %>%
  filter(count>50) %>%   # 過濾出現太少次的字
  wordcloud2()
## Adding missing grouping variables: `artDate`

看前後兩天的討論情況

2021-02-05的文字雲,為翁立友招開記者會,對比2月3號,由於翁立友記著會上也沒有提出有力證據證明自己沒有性騷擾,因此,「證據」、「覺得」、「應該」等詞彙還是出現,這場記者會為翁立友招開,但對雞排妹的討論度,翁立友甚至比雞排妹賣的飛機杯少,因此根本沒人在乎翁立友(幫QQ)。

2021-02-05 文字雲

# 畫出文字雲
word_count %>% 
  #filter(!(word %in% c("飛機"))) %>%
  filter(artDate == as.Date("2021-02-05")) %>% 
  select(word,count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count)) %>%
  arrange(desc(count)) %>%
  filter(count>100) %>%   # 過濾出現太少次的字
  wordcloud2()
## Adding missing grouping variables: `artDate`
# plot_0317

2021-02-08 文字雲

在兩人各開過一次記者會之後,此事件成了羅生門,雙方各說各話,因此在不久之後兩人的事件就被愛莉莎莎VS蒼藍鴿給蓋過。 2月8號,兩人名字出現次數相當但對比前面明顯下降許多,

# 畫出文字雲
word_count %>% 
  #filter(!(word %in% c("鮭魚"))) %>%
  filter(artDate == as.Date('2021-02-08')) %>% 
  select(word,count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count)) %>%
  arrange(desc(count)) %>%
  filter(count>10) %>%   # 過濾出現太少次的字
  wordcloud2()
## Adding missing grouping variables: `artDate`
# plot_0317

5.找出情緒字典代表字

算出所有字詞的詞頻(sentiment_sum),找出情緒代表字

正負情緒代表字

# sentiment_sum:word,sentiment,sum
sentiment_sum <- 
  word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 
## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
sentiment_sum %>%
  top_n(30,wt = sum) %>%
  #filter(!(word %in% c("性騷擾"))) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

另外一種呈現方式

正負情緒文字雲

sentiment_sum %>%
  #filter(!(word %in% c("性騷擾"))) %>%
  acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
  comparison.cloud(
    colors = c("salmon", "#72bcd4"), # positive negative
                   max.words = 50)

另外,也可以依據不同日期觀察情緒代表字的變化

2021-02-03 正負情緒代表字

sentiment_sum_select_0203 <- 
word_count %>%
  filter(artDate == as.Date('2021-02-03')) %>% 
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 
## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
sentiment_sum_select_0203   %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  #filter(!(word %in% c("性騷擾"))) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment 0203",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

2021-02-03 正負情緒文字雲

sentiment_sum_select_0203 %>%
  acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
  comparison.cloud(
    colors = c("salmon", "#72bcd4"), # positive negative
                   max.words = 50)

### 2021-02-05 正負情緒代表字

sentiment_sum_select_0205 <- 
word_count %>%
  filter(artDate == as.Date('2021-02-05')) %>% 
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 
## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
sentiment_sum_select_0205   %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment 0205",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

2021-03-17 正負情緒文字雲

sentiment_sum_select_0205 %>%
  acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
  comparison.cloud(
    colors = c("salmon", "#72bcd4"), # positive negative
                   max.words = 50)

6.歸類正負面文章

之前的情緒分析大部分是全部的詞彙加總,接下來將正負面情緒的文章分開,看看能不能發現一些新的東西。接下來歸類文章,將每一篇文章正負面情緒的分數算出來,然後大概分類文章屬於正面還是負面。

# 依據情緒值的正負比例歸類文章
article_type = 
  data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開,缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame() 
## Joining, by = "word"
## `summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
# 看一下正負比例的文章各有幾篇
article_type %>%
  group_by(type) %>%
  summarise(count = n())
## # A tibble: 2 x 2
##   type     count
## * <chr>    <int>
## 1 negative   950
## 2 positive   434

可以看到在2/5號記者會之後,負面文章明顯增加。

正負情緒文章數量統計圖

# 
article_type_date = left_join(article_type[,c("artUrl", "type")], MetaData[,c("artUrl", "artDate")], by = "artUrl")


article_type_date %>%
  group_by(artDate,type) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = artDate, y = count, fill = type)) + 
  geom_bar(stat = "identity", position = "dodge")+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-01-29','2021-02-09'))
               )
## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
## Warning: Removed 37 rows containing missing values (geom_bar).

把正面和負面的文章挑出來,並和斷詞結果合併。

# negative_article:artUrl,word
negative_article <-
article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
positive_article <-
article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")

畫出正負面文章情緒貢獻度較高的關鍵字

情緒關鍵字:負面情緒文章

# 負面情緒關鍵字貢獻圖
negative_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()
## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

情緒關鍵字:正面情緒文章

# 正面情緒關鍵字貢獻圖
positive_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to positive sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()
## Joining, by = "word"
## `summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

從正負面情緒圖觀察發現,正面和負面的關鍵字沒有甚麼顯著的差異,負面情緒較高的文章比較常出現「性騷擾」、「噁心」、「騷擾」等比較著重在批評性騷擾行為的行為;正面情緒較高的文章出現較多的「笑死」、「支持」、「相信」等字詞,著重在討論鄉民各自支持的人。

7.加入其他資料來源比較

# 加入dcard資料作比較
Dcard  = fread('../hw_data/dcardd翁__articleMetaData.csv',encoding = 'UTF-8')
DToken <- Dcard %>% unnest_tokens(word, sentence, token=customized_tokenizer)
PTT_Token <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl", "word")]) 

News = fread('../hw_data/新聞_翁_articleMetaData.csv')
N_Token <- News %>% unnest_tokens(word, sentence, token=customized_tokenizer)

PTT_Token = PTT_Token %>% mutate(source = "ptt")
Dcard_Token = DToken %>% mutate(source = "dcard")
NEWS_Token = N_Token %>% mutate(source = "news")

# 把資料併在一起
data_combine = rbind(PTT_Token,Dcard_Token[,c("artDate","artUrl", "word","source")])
#data_combine = rbind(PTT_Token,NEWS_Token[,c("artDate","artUrl", "word","source")])

data_combine$artDate= data_combine$artDate %>% as.Date("%Y/%m/%d")
#data_combine$artDate= data_combine$artDate %>% as.Date("%Y/%m/%d")

ptt和dcard的情緒分布直方圖,可以發現dcard相較於ptt正面情緒稍多,話題討論高峰的時間點也大致相同。

ptt、dcard情緒分數比較

range(Dcard$artDate) #"2021/03/15" "2021/03/21"
## [1] "2021/01/30" "2021/03/13"
data_combine %>%
  inner_join(LIWC) %>%
  group_by(artDate,sentiment,source) %>%
  summarise(count = n()) %>%
  filter(artDate<='2021-02-03' || artDate >='2021-02-03') %>%
  
  # 畫圖的部分
  ggplot(aes(x= artDate,y=count,fill=sentiment)) +
  scale_color_manual() +
  geom_col(position="dodge") + 
  scale_x_date(labels = date_format("%m/%d")) +
  labs(title = "sentiment of ptt & dcard",color = "情緒類別") +
  facet_wrap(~source, ncol = 1, scales="free_y")  # scale可以調整比例尺
## Joining, by = "word"
## `summarise()` has grouped output by 'artDate', 'sentiment'. You can override using the `.groups` argument.

總結

最後總結一下之前提出的問題:

1.討論大概出現在哪個時間點,話題高峰在哪裡?

話題高峰大概在2/5有較熱烈的討論

2.正面和負面的討論內容各是甚麼,有沒有時間點上的差異?

主要圍繞在性騷擾如批評此行為噁心,或證據、覺得等希望雙方能提出有力證據而不是以感覺評斷 在2/5記者會之後,主要討論在鄉民們各自相信、支持哪一方

3.正面和負面討論的情緒分數哪個較高?

負面情緒一直高於正面,於2/5來到最高峰

4.翁立友與雞排妹兩人的討論度?

雞排妹討論度一直輾壓翁立友,甚至在2/3討論飛機杯的人比翁立友還多,沒人鳥翁立友(在幫QQ)

練習練習~

1.算出dcard網友的正負面情緒分數,並用折線圖呈現在同一張圖上 2.畫出dcard網友正負面代表字的文字雲,觀察dcard和ptt上對鮭魚之亂評論的差異

### Code Here ###

Homework

以讀書會為單位,針對有興趣的議題分析資料,作業轉成RPubs發布,並將連結上傳至網大「第五週HW」,每組一人上傳即可。

  • 資料來源:有興趣的議題即可,來源不限
  • 作業內容:針對有興趣的議題做出假設,並利用情緒分析相關的套件進行分析,如情緒分數、情緒代表字等。