第五周：對ptt, 聯合新聞網有關疫苗的貼文與新聞進行情緒分析

系統參數設定

Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼

## [1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8"

安裝需要的packages

# echo = T,results = 'hide'
packages = c("dplyr", "tidytext", "stringr", "wordcloud2", "ggplot2",'readr','data.table','reshape2','wordcloud','tidyr','scales')
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

讀進library

library(dplyr)
library(stringr)
library(tidytext)
library(wordcloud2)
library(data.table)
library(ggplot2)
library(reshape2)
library(wordcloud)
library(tidyr)
library(readr)
library(scales)
require(jiebaR)

資料基本介紹

資料來源: 文字平台收集PTT Gossip版2021-03-12 ~ 2021-03-26 所有文章
資料集： ptt_vaccine.csv
關鍵字：疫苗, AZ, 衛福部
資料時間：2021-03-12 ~ 2021-03-26

這次我們以最近 covid-19 疫苗開打以及疫苗調閱小組的爭議為分析主題，主要分析 PTT 八卦版的網友對於這兩個主要事件的相關討論，並且跟新聞媒體的報導做比較。
本次分析主要針對以下方向分析：

疫苗相關的討論主要聚集於哪幾天，話題高峰與討論主題為何？
正面與負面的討論內容為何？
PTT 貼文與媒體報導的情緒強度是否有差，我們假設 PTT 鄉民的會使用比較多的情緒字

在 0312 - 0326 這段時間中，各個資料來源的文章數如下：
- PTT : 322
- 聯合新聞 : 346
- 東森新聞 : 42
- 蘋果新聞 : 77

# 把文章和留言讀進來
MetaData = fread('../group_week_5/ptt_vaccine_articleMetaData.csv',encoding = 'UTF-8')
Reviews  = fread('../group_week_5/ptt_vaccine_articleReviews.csv',encoding = 'UTF-8')

# 再篩一次文章 333 篇
keywords = c('疫苗','AZ','衛福部')
toMatch = paste(keywords,collapse="|")
pttData = with(MetaData, MetaData[grepl(toMatch,sentence)|grepl(toMatch,artTitle),])

# 挑選文章對應的留言
Reviews = left_join(MetaData, Reviews[,c("artUrl", "cmtContent")], by = "artUrl")

1. 資料前處理

(1). 文章斷詞

設定斷詞引擎

# 加入自定義的字典
jieba_tokenizer <- worker(user="../group_week_5/user_dict.txt", stop_word = "../group_week_5/stop_words.txt")

# 設定斷詞function
customized_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    return(tokens)
  })
}

# 把文章和留言的斷詞結果併在一起
MToken <- MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
RToken <- Reviews %>% unnest_tokens(word, cmtContent, token=customized_tokenizer)

# 把資料併在一起
data <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl", "word")])

(2). 資料基本清理

日期格式化
去除特殊字元、詞頻太低的字

# 格式化日期欄位
data$artDate= data$artDate %>% as.Date("%Y/%m/%d")

# 過濾特殊字元
data_select = data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1) 
  
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))

## `summarise()` regrouping output by 'artDate' (override with `.groups` argument)

word_count

## # A tibble: 4,510 x 3
## # Groups:   artDate [15]
##    artDate    word  count
##    <date>     <chr> <int>
##  1 2021-03-22 疫苗    444
##  2 2021-03-24 疫苗    383
##  3 2021-03-16 疫苗    356
##  4 2021-03-12 疫苗    260
##  5 2021-03-25 疫苗    255
##  6 2021-03-19 疫苗    239
##  7 2021-03-18 監督    236
##  8 2021-03-17 疫苗    217
##  9 2021-03-13 疫苗    212
## 10 2021-03-14 疫苗    200
## # … with 4,500 more rows

2. 準備LIWC字典

全名Linguistic Inquiry and Word Counts，由心理學家Pennebaker於2001出版分為正向情緒與負向情緒

讀檔，字詞間以“,”將字分隔

P <- read_file("../week5_tutorial/positive.txt") # 正向字典txt檔
N <- read_file("../week5_tutorial/negative.txt") # 負向字典txt檔

#字典txt檔讀進來是一整個字串
typeof(P)

## [1] "character"

分割字詞，並將兩個情緒字典併在一起

# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]

# 建立dataframe 有兩個欄位word,sentiments，word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047

# 把兩個字典拼在一起
LIWC = rbind(P, N)

# 檢視字典
head(LIWC)

##       word sentiment
## 1     一流  positive
## 2 下定決心  positive
## 3 不拘小節  positive
## 4   不費力  positive
## 5     不錯  positive
## 6     主動  positive

3. 將文章和與LIWC情緒字典做join

正負情緒發文折線圖

PTT 發文折線圖

MetaData$artDate= MetaData$artDate %>% as.Date("%Y/%m/%d")
MetaData %>%
  group_by(artDate) %>%
  summarise(count = n()) %>%
  ggplot()+
    geom_line(aes(x=artDate,y=count))+
    scale_x_date(labels = date_format("%m/%d"))

## `summarise()` ungrouping output (override with `.groups` argument)

找出文集中，對於LIWC字典是positive和negative的字

算出每天情緒總和(sentiment_count)

# sentiment_count:artDate,sentiment,count
sentiment_count = data_select %>%
  select(artDate, artUrl, word) %>%
  inner_join(LIWC) %>% 
  group_by(artDate,sentiment) %>%
  summarise(count=n())

## Joining, by = "word"

## `summarise()` regrouping output by 'artDate' (override with `.groups` argument)

畫出每天的情緒總分數，可以看出整題而言，PTT 上的貼文用詞使用的負面詞比正面詞的數量多，只有 03/22 的正面和負面詞彙出現的次數差不多。

正負情緒分數折線圖

# 檢視資料的日期區間
range(sentiment_count$artDate) #"2021-03-12" "2021-03-26"

## [1] "2021-03-11" "2021-03-25"

sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-03-12','2021-03-26'))
               )+
  # 加上標示日期的線
  geom_vline(aes(
    xintercept = as.numeric(
      artDate[which(sentiment_count$artDate == as.Date('2021-03-16'))[1]])),colour = "black") +
  geom_vline(aes(
    xintercept = as.numeric(
      artDate[which(sentiment_count$artDate == as.Date('2021-03-22'))[1]])),colour = "black")

## Warning: Removed 2 row(s) containing missing values (geom_path).

將情緒分數標準化後再畫一次圖，可以發現雖然正負面情緒有波動，但是多數文章的情緒還是以負面為主。

正負情緒比例折線圖

sentiment_count %>% 
  # 標準化的部分
  group_by(artDate) %>%
  mutate(ratio = count/sum(count)) %>%
  #mutate(ratio = count/length(unique(artUrl))) %>%
  # 畫圖的部分
  ggplot()+
  geom_line(aes(x=artDate,y=ratio,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-03-12','2021-03-26'))
               )+
  # 加上標示日期的線
  geom_vline(aes(
    xintercept = as.numeric(
      artDate[which(sentiment_count$artDate == as.Date('2021-03-16')) [1]])),colour = "black")+
  geom_vline(aes(
    xintercept = as.numeric(
      artDate[which(sentiment_count$artDate == as.Date('2021-03-22')) [1]])),colour = "black")

## Warning: Removed 2 row(s) containing missing values (geom_path).

我們挑出幾個文章數量與情緒都處於高點的日期

# 查看每天的情緒分數排名
sentiment_count %>%
  select(count,artDate) %>%
  group_by(artDate) %>%
  summarise(sum = sum(count)) %>%
  arrange(desc(sum))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 15 x 2
##    artDate      sum
##    <date>     <int>
##  1 2021-03-22   876
##  2 2021-03-16   726
##  3 2021-03-24   621
##  4 2021-03-19   468
##  5 2021-03-14   370
##  6 2021-03-12   352
##  7 2021-03-25   346
##  8 2021-03-15   337
##  9 2021-03-18   328
## 10 2021-03-23   319
## 11 2021-03-20   303
## 12 2021-03-13   259
## 13 2021-03-17   199
## 14 2021-03-21   173
## 15 2021-03-11    42

4. 畫出文字雲

挑出有興趣的日期，畫出文字雲看看都在討論甚麼主題。

先從2021-03-16 的情緒高點看起，出現「巴拉圭」、「血栓」，推測討論的主題可能跟資助巴拉圭購買疫苗、AZ疫苗的血栓疑慮以及該不該繼續施打 AZ 疫苗有關。

2021-03-16 文字雲

# 畫出文字雲

word_count %>% 
  filter(!(word %in% c("疫苗"))) %>%
  filter(artDate == as.Date('2021-03-16')) %>% 
  select(word,count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count)) %>%
  arrange(desc(count)) %>%
  filter(count>20) %>%   # 過濾出現太少次的字
  wordcloud2()

## Adding missing grouping variables: `artDate`

## `summarise()` ungrouping output (override with `.groups` argument)

3月22日的文字雲中，出現「直播」、「公開」、「監督」、「立委」等詞彙，推測討論的議題有：疫苗調閱小組、疫苗開打、行政院長打假針

2021-03-22 文字雲

# 畫出文字雲
word_count %>% 
  filter(!(word %in% c("疫苗"))) %>%
  filter(artDate == as.Date('2021-03-22')) %>% 
  select(word,count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count)) %>%
  arrange(desc(count)) %>%
  filter(count>20) %>%   # 過濾出現太少次的字
  wordcloud2()

## Adding missing grouping variables: `artDate`

## `summarise()` ungrouping output (override with `.groups` argument)

## 5.找出情緒字典代表字

算出所有字詞的詞頻(sentiment_sum)，找出情緒代表字

正負情緒代表字

排除情緒性的謾罵字眼後，可以看到與疫苗相關的貼文中的負面用詞主要跟施打疫苗的安全問題有關。

# sentiment_sum:word,sentiment,sum
sentiment_sum <- 
  word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame()

## Joining, by = "word"

## `summarise()` regrouping output by 'word' (override with `.groups` argument)

sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  theme(text=element_text(family = "Heiti TC Light",size=8))+
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

正負情緒文字雲

par(family=("Heiti TC Light"))
sentiment_sum %>%
  acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
  comparison.cloud(
    colors = c("salmon", "#72bcd4"), # positive negative
                   max.words = 50)

另外，也可以依據不同日期觀察情緒代表字的變化

2021-03-16 正負情緒代表字

sentiment_sum_select <- 
word_count %>%
  filter(artDate == as.Date('2021-03-16')) %>% 
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame()

## Joining, by = "word"

## `summarise()` regrouping output by 'word' (override with `.groups` argument)

sentiment_sum_select   %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  theme(text=element_text(family = "Heiti TC Light",size=8))+
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment 0316",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

2021-03-16 正負情緒文字雲

par(family=("Heiti TC Light"))
sentiment_sum_select %>%
  acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
  comparison.cloud(
    colors = c("salmon", "#72bcd4"), # positive negative
                   max.words = 50)

2021-03-22 正負情緒代表字

sentiment_sum_select <- 
word_count %>%
  filter(artDate == as.Date('2021-03-22')) %>% 
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame()

## Joining, by = "word"

## `summarise()` regrouping output by 'word' (override with `.groups` argument)

sentiment_sum_select   %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment 0322",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

2021-03-22 正負情緒文字雲

par(family=("Heiti TC Light"))
sentiment_sum_select %>%
  acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
  comparison.cloud(
    colors = c("salmon", "#72bcd4"), # positive negative
                   max.words = 50)

6.歸類正負面文章

之前的情緒分析大部分是全部的詞彙加總，接下來將正負面情緒的文章分開，看看能不能發現一些新的東西。接下來歸類文章，將每一篇文章正負面情緒的分數算出來，然後大概分類文章屬於正面還是負面。

# 依據情緒值的正負比例歸類文章
article_type = 
  data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame()

## Joining, by = "word"

## `summarise()` regrouping output by 'artUrl' (override with `.groups` argument)

# 看一下正負比例的文章各有幾篇
article_type %>%
  group_by(type) %>%
  summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   type     count
##   <chr>    <int>
## 1 negative   175
## 2 positive   151

正負情緒文章數量統計圖

# 
article_type_date = left_join(article_type[,c("artUrl", "type")], MetaData[,c("artUrl", "artDate")], by = "artUrl")


article_type_date %>%
  group_by(artDate,type) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = artDate, y = count, fill = type)) + 
  geom_bar(stat = "identity", position = "dodge")+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-03-12','2021-03-25'))
               )

## `summarise()` regrouping output by 'artDate' (override with `.groups` argument)

## Warning: Removed 3 rows containing missing values (geom_bar).

把正面和負面的文章挑出來，並和斷詞結果合併。

# negative_article:artUrl,word
negative_article <-
article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
positive_article <-
article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")

畫出正負面文章情緒貢獻度較高的關鍵字

情緒關鍵字:負面情緒文章

# 負面情緒關鍵字貢獻圖
negative_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

## Joining, by = "word"

## `summarise()` regrouping output by 'word' (override with `.groups` argument)

情緒關鍵字:正面情緒文章

# 正面情緒關鍵字貢獻圖
positive_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to positive sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()

## Joining, by = "word"

## `summarise()` regrouping output by 'word' (override with `.groups` argument)

從正負面情緒圖觀察發現，正面和負面的關鍵字沒有甚麼顯著的差異，負面情緒較高的文章比較常出現「問題」、「風險」、「嚴重」等討論疫苗風險負面影響的字詞；正面情緒較高的文章出現較多的「相信」、「願意」、「支持」等字詞。

7.加入其他資料來源比較

我們加入東森新聞網、聯合新聞網以及蘋果新聞網的資料做比較希望能比較不同新聞媒體的報導以及PTT的貼文的情緒強度，中時新聞因為政治版以及國際版沒有與疫苗相關的報導，所以沒有採用

重新整理 PTT token，這裡只計算貼文的詞彙

PTT_Token <- rbind(MToken[,c("artDate","artUrl", "word")]) 

PTT_Token = PTT_Token %>% mutate(source = "ptt")

# 加入聯合新聞網資料作比較
udn  = fread('../group_week_5/udn_vaccine_articleMetaData.csv',encoding = 'UTF-8')
UToken <- udn %>% unnest_tokens(word, sentence, token=customized_tokenizer)

UDN_Token = UToken %>% mutate(source = "udn")

# 加入東森新聞網資料做比較
ebc  = fread('../group_week_5/ebc_vaccine_articleMetaData.csv',encoding = 'UTF-8')
EToken <- ebc %>% unnest_tokens(word, sentence, token=customized_tokenizer)

EBC_Token = EToken %>% mutate(source = "ebc")

# 加入蘋果新聞網資料做比較
apple  = fread('../group_week_5/apple_vaccine_articleMetaData.csv',encoding = 'UTF-8')
AToken <- apple %>% unnest_tokens(word, sentence, token=customized_tokenizer)

APPLE_Token = AToken %>% mutate(source = "apple")

# 把資料併在一起
data_combine = rbind(PTT_Token, 
                     UDN_Token[, c("artDate", "artUrl", "word", "source")], 
                     EBC_Token[, c("artDate", "artUrl", "word", "source")],
                     APPLE_Token[, c("artDate", "artUrl", "word", "source")])

data_combine$artDate= data_combine$artDate %>% as.Date("%Y/%m/%d")

ptt和dcard的情緒分布直方圖，可以發現dcard相較於ptt正面情緒稍多，話題討論高峰的時間點也大致相同。

多個媒體的文章的情緒分數比較

range(udn$artDate) #"2021/03/12" "2021/03/26"

## [1] "2021/03/12" "2021/03/25"

data_combine %>%
  inner_join(LIWC) %>%
  group_by(artDate,sentiment,source) %>%
  summarise(count = n()) %>%
  filter(artDate>='2021-03-12') %>%
  
  # 畫圖的部分
  ggplot(aes(x= artDate,y=count,fill=sentiment)) +
  scale_color_manual() +
  geom_col(position="dodge") + 
  scale_x_date(labels = date_format("%m/%d")) +
  labs(title = "Sentiment of PTT, UDN, EBC and Apple",color = "情緒類別") +
  facet_wrap(~source, ncol = 1, scales="free_y")  # scale可以調整比例尺

## Joining, by = "word"

## `summarise()` regrouping output by 'artDate', 'sentiment' (override with `.groups` argument)

總結

總結上述提到的三個問題：

疫苗相關的討論主要聚集於哪幾天，話題高峰與討論主題為何？ > 疫苗相關的討論高峰出現在 03/16 與 03/22 兩天，03/16 主要討論 AZ疫苗的安全問題和政府資助巴拉圭政府購買疫苗，03/22 主要討論當天疫苗開打和官員打假針的議題
正面與負面的討論內容為何？ > 正面的討論內容主要是相信疫苗的效果、願意施打疫苗，文字雲出現「相信」、「願意」、「支持」等詞彙。 > 負面的討論內容則是疫苗的安全問題，文字雲出現「問題」、「嚴重」、「風險」等詞彙。
PTT 貼文與媒體報導的情緒強度是否有差，我們假設 PTT 鄉民的會使用比較多的情緒字 > 比較文章數量教接近的 PTT 跟聯合新聞網會發現，除去03/22那天，PTT 的貼文使用的情緒字次數與聯合新聞網相差不多，考慮到新聞報導的長度通常會比 PTT 貼文長，我們推論 PTT 和聯合新聞網的情緒字使用頻率應該差不多。
而比較東森新聞與蘋果新聞的每日情緒字次數可以發現，蘋果新聞的情緒字使用頻率比東森新聞高。