Ch.0 : 動機分析目的與資料取得

1. 動機與分析目的

去年台灣創下56年來首度沒有颱風登陸的紀錄,這也讓水庫無水進帳,全台面臨缺水的危機,在水利署於2月25日發出的水情燈號中,已有許多縣市進入減量供水的燈色號燈。由於越來越嚴峻的水情,高雄市自4月17日起停止洗車場、公私立游泳池供水等,市府並配合水利署加速開鑿水井。在宿舍的浴室中也出現了應對停水的大水桶。

我們想探討:

  • 在PTT以及Dcard人們對於水情所討論的焦點都有哪些,情緒變化是怎樣的?
  • 各家媒體對於水情的報導有何不同?
  • 各個縣市對於缺水狀況關注的重點是否一樣?
  • 全球對於台灣水情關心的重點又在哪裏?

2. 資料取得及套件載入

資料基本介紹

  • 資料來源: 文字平台收集PTT Gossip版 + Dcard時事版
  • 資料集: PPTnowater_articleMetaData.csv
  • 關鍵字:缺水、水情、水庫、下雨
  • 2021/02/01 ~ 2021/04/25 共得到 886 篇文章。

簡單的資料比較

Dcard的資料量遠少於PTT。可以看出年輕人對於水情的關注度不高。

系統參數設定

[1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/en_US.UTF-8"

安裝需要的packages

讀進library

Ch.1 : 資料的基本分析

1. 資料前處理

  • 文章斷詞
  • 資料基本清理

(1). 文章斷詞

設定斷詞引擎

(2). 資料基本清理

  • 日期格式化
  • 去除特殊字元、詞頻太低的字
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

Ch.2 : 情緒分析以及文字雲

1. 將文章和與LIWC情緒字典做join

四月正負情緒分數折線圖

[1] "2021-02-05" "2021-04-28"

  • 可以看出:3月負面情緒佔據了主導地位,4月又幾天的正面情緒超過了負面情緒

  • 主要原因是4月開始逐漸降雨量增多,尤其是4.25前後高雄有持續幾天的陰天降雨

2. 畫出文字雲

2021-04-25 文字雲

Adding missing grouping variables: `artDate`

因為高雄市長陳其邁昨在臉書PO文分享穿西裝淋雨罩,引發熱議。

2021-04-12 文字雲

水情嚴峻 黃偉哲:台南水庫用水可撐到7月底

  • 清淤:水庫存量下降、水情吃緊,卻也意外迎來清淤的好時機,因此水利署加大清淤力道,去年的清淤量達1440萬立方公尺,創下歷史新高紀錄。

  • 超前部署:超前部署,這3年多來對於區域供水,比如把翡翠水庫的水引到新北、石門水庫引到新竹,不然會更加嚴重。

3.找出情緒字典代表字

算出所有字詞的詞頻(sentiment_sum),找出情緒代表字

4.歸類正負面文章

之前的情緒分析大部分是全部的詞彙加總,接下來將正負面情緒的文章分開,看看能不能發現一些新的東西。接下來歸類文章,將每一篇文章正負面情緒的分數算出來,然後大概分類文章屬於正面還是負面。

Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.

Ch.3: 各家新聞媒體的情緒分析

  • ETTV:東森
  • UDN:聯合
  • Apple:蘋果

1. 資料前處理

`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

3.將三個新聞網的資料合併做比較

Joining, by = "word"
`summarise()` has grouped output by 'artDate', 'sentiment'. You can override using the `.groups` argument.

可以看出:

  • 只有聯合新聞是有連續每天的在跟進水情的新聞
  • 聯合新聞的情緒詞出現的較多
  • 後期的報導都以正面情緒居多
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

5.歸類正負面文章

Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.

Ch.4: 各個縣市對於水情關注的重點

1.ngram

前後五個字彙

3.出現在各個縣市周圍的詞彙

台南、台中、台北、宜蘭、中南部

  • 台南、高雄:高雄支援台南用水2/25之後轉黃燈 所以暫停支援

  • 台中:台中、苗栗、新竹水情嚴種 4/6起停5供2

  • 宜蘭:台積電為解決缺水,傳出設廠宜蘭的消息

4.其他一些較常出現的詞彙

台積電、民生、農業

  • 台積電: 台灣易缺水缺電擬移設備至南京廠

  • 農業: 農業和半導體為兩大主要用水,缺水問題影響兩大產業

Ch.5: 主題模型的分析

建立LDA模型

統計每篇文章詞頻

The `...` argument of `group_keys()` is deprecated as of dplyr 1.0.0.
Please `group_by()` first
<<DocumentTermMatrix (documents: 741, terms: 3173)>>
Non-/sparse entries: 63859/2287334
Sparsity           : 97%
Maximal term length: 5
Weighting          : term frequency (tf)

兩主題之間相差最大的詞彙

正越大表示越傾向主題二,負越大越傾向主題一,

LDAvis

只分為兩個主題出來的結果並不是很明確,這裡改成分為三個主題。

To stop the server, run servr::daemon_stop(1) or restart your R session
Serving the directory /private/var/folders/ww/lpf7_83x4hb2q8pwlyrcj0vm0000gn/T/RtmpAKESbp/file4cbb29a857bb at http://127.0.0.1:4321

Ch.6:其他

我們分析了Twitter上有關 #Taiwan 和 #drought 作為關鍵字的貼文

發現大家關注的焦點主要是在,晶片和半導體

Ch.7:結論

  • 大家的討論主要還是負面情緒居多,負面情緒的來源主要是希望政府可以對於缺水的情況有更多的作為,以及對於未來水情的擔憂。
  • 各個縣市對於缺水狀況關注的重點,各有不同。
  • 全球對於台灣缺水狀況的反映,主要是對於全球晶片產能的擔憂。
---
title: "社群媒體期中報告 - 最近兩個月水情的討論分析"
author: "張惠茹、王弘銘、陳宥任、葉思卿"
date: "2021/4/30"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    highlight: pygments
    theme: flatly
    css: style.css
  html_document:
    toc: yes
    df_print: paged
---

# Ch.0 : 動機分析目的與資料取得

## 1. 動機與分析目的

去年台灣創下56年來首度沒有颱風登陸的紀錄，這也讓水庫無水進帳，全台面臨缺水的危機，在水利署於2月25日發出的水情燈號中，已有許多縣市進入減量供水的燈色號燈。由於越來越嚴峻的水情，高雄市自4月17日起停止洗車場、公私立游泳池供水等，市府並配合水利署加速開鑿水井。在宿舍的浴室中也出現了應對停水的大水桶。

我們想探討：

+ 在PTT以及Dcard人們對於水情所討論的焦點都有哪些，情緒變化是怎樣的？
+ 各家媒體對於水情的報導有何不同？
+ 各個縣市對於缺水狀況關注的重點是否一樣？
+ 全球對於台灣水情關心的重點又在哪裏？

## 2. 資料取得及套件載入

### 資料基本介紹

+ 資料來源: 文字平台收集PTT Gossip版 + Dcard時事版
+ 資料集： PPTnowater_articleMetaData.csv
+ 關鍵字：缺水、水情、水庫、下雨
+ 2021/02/01 ~ 2021/04/25 共得到 886 篇文章。

簡單的資料比較

![](compare.png)

Dcard的資料量遠少於PTT。可以看出年輕人對於水情的關注度不高。

系統參數設定
```{r,warning=FALSE,message=FALSE}
Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼
```

安裝需要的packages
```{r warning=FALSE}
# echo = T,results = 'hide'
packages = c("dplyr", "tidytext", "stringr", "wordcloud2", "ggplot2",'readr','data.table','reshape2','wordcloud','tidyr','scales','ngram')
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
```

讀進library
```{r,warning=FALSE,message=FALSE}
library(dplyr)
library(stringr)
library(tidytext)
library(wordcloud2)
library(data.table)
library(ggplot2)
library(reshape2)
library(wordcloud)
library(tidyr)
library(readr)
library(scales)
library(jiebaR)
library(ngram)

library(gutenbergr)
library(widyr)
library(NLP)
library(ggraph)
library(igraph)
library(tm)
library(slam)
library(Rtsne)
library(randomcoloR)
library(topicmodels)
library(LDAvis)
library(webshot)
library(htmlwidgets)
library(servr)
```
```{r}
setwd("/Users/a1234/Downloads/project")
```

```{r}
# 把文章和留言讀進來
MetaData = read.csv('PPTnowater_articleMetaData.csv',encoding = 'UTF-8')
Reviews  = read.csv('PTTnowater_articleReviews.csv',encoding = 'UTF-8')

MetaData$sentence <- as.character(MetaData$sentence)
Reviews$cmtContent <- as.character(Reviews$cmtContent)

### 移除PTT貼新聞時會出現的格式用字
MetaData <- MetaData %>% 
  mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", sentence))

# 挑選文章對應的留言
Reviews = left_join(MetaData, Reviews[,c("artUrl", "cmtContent")], by = "artUrl")
```

# Ch.1 : 資料的基本分析

## 1. 資料前處理

+ 文章斷詞
+ 資料基本清理

(1). 文章斷詞

設定斷詞引擎
```{r}
# 加入自定義的字典
jieba_tokenizer <- worker(user="dict/user_dict.txt", stop_word = "dict/stop_words.txt")

# 設定斷詞function
customized_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    tokens <- tokens[nchar(tokens)>1]
    return(tokens)
  })
}
```

```{r}
# 把文章和留言的斷詞結果併在一起
MToken <- MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
RToken <- Reviews %>% unnest_tokens(word, cmtContent, token=customized_tokenizer)

# 把資料併在一起
data <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl","word")]) 
```

(2). 資料基本清理

+ 日期格式化
+ 去除特殊字元、詞頻太低的字

```{r}
# 格式化日期欄位
data$artDate= data$artDate %>% as.Date("%Y/%m/%d")

# 過濾特殊字元
data_select = data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1) 
  
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))
head(word_count)
```


## 2. 準備LIWC字典

> 全名Linguistic Inquiry and Word Counts，由心理學家Pennebaker於2001出版
> 分為正向情緒與負向情緒


```{r}
#讀檔，字詞間以","將字分隔
P <- read_file("dict/liwc/positive.txt") # 正向字典txt檔
N <- read_file("dict/liwc/negative.txt") # 負向字典txt檔

# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]

# 建立dataframe 有兩個欄位word,sentiments，word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047

# 把兩個字典拼在一起
LIWC = rbind(P, N)
```

# Ch.2 : 情緒分析以及文字雲

## 1. 將文章和與LIWC情緒字典做join

### 發文折線圖
```{r}
MetaData$artDate= MetaData$artDate %>% as.Date("%Y/%m/%d")
MetaData %>%
  group_by(artDate) %>%
  summarise(count = n()) %>%
  ggplot()+
    geom_line(aes(x=artDate,y=count))+
    scale_x_date(labels = date_format("%m/%d"))+
  geom_vline(aes(xintercept = as.numeric(artDate[which(artDate == as.Date('2021-04-12'))
[1]])),colour = "red") 
```

> 找出文集中，對於LIWC字典是positive和negative的字

算出每天情緒總和(sentiment_count)
```{r}
# sentiment_count:artDate,sentiment,count
sentiment_count = data_select %>%
  select(artDate,word) %>%
  inner_join(LIWC) %>% 
  group_by(artDate,sentiment) %>%
  summarise(count=n())  
```


### 三月正負情緒分數折線圖
```{r}
# 檢視資料的日期區間
range(sentiment_count$artDate)
sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-03-01','2021-03-31'))
               )
```

### 四月正負情緒分數折線圖
```{r}
# 檢視資料的日期區間
range(sentiment_count$artDate)
sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-04-01','2021-04-30'))
               )+
  # 加上標示日期的線
  geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-04-25'))
[1]])),colour = "black")+
  geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-04-03'))
[1]])),colour = "black") 
```

+ 可以看出：3月負面情緒佔據了主導地位，4月又幾天的正面情緒超過了負面情緒

+ 主要原因是4月開始逐漸降雨量增多，尤其是4.25前後高雄有持續幾天的陰天降雨

```{r}
# 查看每天的情緒分數排名
show_top3 <- sentiment_count %>%
  select(count,artDate) %>%
  group_by(artDate) %>%
  summarise(sum = sum(count)) %>%
  arrange(desc(sum))

head(show_top3,3)
```

## 2. 畫出文字雲

### 2021-04-25 文字雲
```{r}
# 畫出文字雲
word_count %>%
  filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>%
  filter(artDate == as.Date('2021-04-25')) %>% 
  select(word,count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count)) %>%
  arrange(desc(count)) %>%
  filter(count>30) %>%   # 過濾出現太少次的字
  wordcloud2()
```

因為高雄市長陳其邁昨在臉書PO文分享穿西裝淋雨罩，引發熱議。

### 2021-03-05 文字雲
```{r}
# 畫出文字雲
# plot_0305=word_count %>%
#   filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>%
#   filter(artDate == as.Date('2021-03-05')) %>% 
#   select(word,count) %>% 
#   group_by(word) %>% 
#   summarise(count = sum(count)) %>%
#   arrange(desc(count)) %>%
#   filter(count>20) %>%   # 過濾出現太少次的字
#   wordcloud2()
# plot_0305
```
![](305.png)

### 2021-04-12 文字雲
```{r,warning=FALSE,message=FALSE}
# 畫出文字雲
# plot_0412 = word_count %>% 
#   filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>%
#   filter(artDate == as.Date('2021-04-12')) %>% 
#   select(word,count) %>% 
#   group_by(word) %>% 
#   summarise(count = sum(count)) %>%
#   arrange(desc(count)) %>%
#   filter(count>20) %>%   # 過濾出現太少次的字
#   wordcloud2()
# plot_0412
```
![](412.png)

水情嚴峻 黃偉哲：台南水庫用水可撐到7月底
		
```{r}
data_tokens_date <- data_select %>% 
  filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>% 
  count(artDate, word, sort = TRUE)
data_tokens_date
data_tokens_date$artDate <- data_tokens_date$artDate %>% as.Date("%Y/%m/%d")

plot_merge <- data_tokens_date %>% 
  filter(artDate =="2021-04-12"| 
       artDate == "2021-03-05"| 
       artDate == "2021-04-25" |
        artDate == "2021-03-27")%>%
  group_by(artDate) %>%
  top_n(10,n)%>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(x= word, y=n)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = NULL) +
  facet_wrap(~artDate, scales="free", ncol = 2) +
  coord_flip()+
  theme(text = element_text(family = "Heiti TC Light"))
plot_merge
```

+ 清淤：水庫存量下降、水情吃緊，卻也意外迎來清淤的好時機，因此水利署加大清淤力道，去年的清淤量達1440萬立方公尺，創下歷史新高紀錄。

+ 超前部署：超前部署，這3年多來對於區域供水，比如把翡翠水庫的水引到新北、石門水庫引到新竹，不然會更加嚴重。


## 3.找出情緒字典代表字

算出所有字詞的詞頻(sentiment_sum)，找出情緒代表字

### 正負情緒代表字
```{r}
# sentiment_sum:word,sentiment,sum
sentiment_sum <- 
  word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 
  
sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

另外一種呈現方式

### 正負情緒文字雲
```{r}
# sentiment_sum %>%
#   acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
#   comparison.cloud(
#     colors = c("salmon", "#72bcd4"), # positive negative
#                    max.words = 50,family = "Heiti TC Light")
```
![](正負情緒文字雲.png)

另外，也可以依據不同日期觀察情緒代表字的變化

### 2021-04-12 正負情緒代表字
```{r}
sentiment_sum_select <- 
word_count %>%
  filter(artDate == as.Date('2021-04-12')) %>% 
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 

sentiment_sum_select   %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment 0412",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```


### 2021-04-12 正負情緒文字雲
```{r}
# sentiment_sum_select %>%
#   acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
#   comparison.cloud(
#     colors = c("salmon", "#72bcd4"), # positive negative
#                    max.words = 50,family = "Heiti TC Light")
```

![](2021-04-12正負情緒文字雲.png)


## 4.歸類正負面文章

之前的情緒分析大部分是全部的詞彙加總，接下來將正負面情緒的文章分開，看看能不能發現一些新的東西。接下來歸類文章，將每一篇文章正負面情緒的分數算出來，然後大概分類文章屬於正面還是負面。

```{r}
# 依據情緒值的正負比例歸類文章
article_type = 
  data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame() 
  
# 看一下正負比例的文章各有幾篇
article_type %>%
  group_by(type) %>%
  summarise(count = n())
```


### 正負情緒文章數量統計圖

已缺水事件來說,負面情緒的文章比較多

```{r}
# 
article_type_date = left_join(article_type[,c("artUrl", "type")], MetaData[,c("artUrl", "artDate")], by = "artUrl")


article_type_date %>%
  group_by(artDate,type) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = artDate, y = count, fill = type)) + 
  geom_bar(stat = "identity", position = "dodge")+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-03-01','2021-04-30'))
               )
```

把正面和負面的文章挑出來，並和斷詞結果合併。

```{r}
# negative_article:artUrl,word
negative_article <-
article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
positive_article <-
article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")
```


畫出正負面文章情緒貢獻度較高的關鍵字

### 情緒關鍵字:負面情緒文章
```{r}
# 負面情緒關鍵字貢獻圖
negative_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

### 情緒關鍵字:正面情緒文章
```{r}
# 正面情緒關鍵字貢獻圖
positive_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to positive sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

# Ch.3: 各家新聞媒體的情緒分析

+ ETTV：東森
+ UDN：聯合
+ Apple：蘋果

## 1. 資料前處理
```{r}
# 把文章讀進來
ETTV_MetaData = fread('ETTV_articleMetaData.csv',encoding = 'UTF-8')
UDN_MetaData = fread('UDN_articleMetaData.csv',encoding = 'UTF-8')
Apple_MetaData = fread('Apple_articleMetaData.csv',encoding = 'UTF-8')
```

```{r}
# 斷詞結果
ETTV_data <- ETTV_MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
UDN_data <- UDN_MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
Apple_data <- Apple_MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
```

```{r}
# 格式化日期欄位
ETTV_data$artDate= ETTV_data$artDate %>% as.Date("%Y/%m/%d")
UDN_data$artDate= UDN_data$artDate %>% as.Date("%Y/%m/%d")
Apple_data$artDate= Apple_data$artDate %>% as.Date("%Y/%m/%d")
# 過濾特殊字元
ETTV_data_select = ETTV_data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1) 
UDN_data_select = UDN_data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1)
Apple_data_select = Apple_data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1)

ETTV_word_count <- ETTV_data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))

UDN_word_count <- UDN_data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))

Apple_word_count <- Apple_data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))
```

## 2.將文章和與LIWC情緒字典做join

在畫出情緒之前，先看看每天的發文情形。

### 各家新聞發文折線圖
```{r}
ETTV_MetaData$artDate= ETTV_MetaData$artDate %>% as.Date("%Y/%m/%d")
UDN_MetaData$artDate= UDN_MetaData$artDate %>% as.Date("%Y/%m/%d")
Apple_MetaData$artDate= Apple_MetaData$artDate %>% as.Date("%Y/%m/%d")

ETTV_Post <- ETTV_MetaData %>% group_by(artDate) %>% summarise(count = n())
UDN_Post <- UDN_MetaData %>% group_by(artDate) %>% summarise(count = n())
Apple_Post <- Apple_MetaData %>% group_by(artDate) %>% summarise(count = n())


ggplot()+
  geom_line(data = ETTV_Post, aes(x=artDate,y=count,colour = "聯合新聞網"))+
  scale_x_date(labels = date_format("%m/%d"))+
  geom_line(data = UDN_Post, aes(x=artDate,y=count,colour ="東森新聞網"))+
  scale_x_date(labels = date_format("%m/%d"))+
  geom_line(data = Apple_Post, aes(x=artDate,y=count,colour ="蘋果新聞網"))+
  scale_x_date(labels = date_format("%m/%d"))+
  scale_colour_manual("",values = c("聯合新聞網" = "red","東森新聞網" = "blue", "蘋果新聞網" = "black"))+
  theme(text=element_text(size=14,family = "Heiti TC Light"))
```

## 3.將三個新聞網的資料合併做比較

```{r}
ETTV_data <- ETTV_data %>% mutate(source = 'ETTV')
UDN_data <- UDN_data %>% mutate(source = 'UDN')
Apple_data <- Apple_data %>% mutate(source = 'Apple')

data_combine = rbind(ETTV_data,UDN_data, Apple_data)
data_combine$artDate= data_combine$artDate %>% as.Date("%Y/%m/%d")
```

```{r}
data_combine %>%
  inner_join(LIWC) %>%
  group_by(artDate,sentiment,source) %>%
  summarise(count = n()) %>%
  filter(artDate>='2021-03-01') %>%
  
  # 畫圖的部分
  ggplot(aes(x= artDate,y=count,fill=sentiment)) +
  scale_color_manual() +
  geom_col(position="dodge") + 
  scale_x_date(labels = date_format("%m/%d")) +
  labs(title = "sentiment of ptt & dcard",color = "情緒類別") +
  facet_wrap(~source, ncol = 1, scales="free_y")  # scale可以調整比例尺
```


可以看出：

+ 只有聯合新聞是有連續每天的在跟進水情的新聞
+ 聯合新聞的情緒詞出現的較多
+ 後期的報導都以正面情緒居多

```{r}
ETTV_sentiment_sum <- 
  ETTV_word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 

UDN_sentiment_sum <- 
  UDN_word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame()

Apple_sentiment_sum <- 
  Apple_word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 
```
## 4.算出所有字詞詞頻後，各新聞網最常出現的情緒代表字

```{r}
ETTV_sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "ETTV Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

```{r}
UDN_sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "UDN Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

```{r}
Apple_sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  filter(!(word %in% c("作品"))) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Apple Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

可以看出東森新聞的報導情緒詞較少，較為中立客觀。

而聯合新聞中的情緒詞則非常豐富，同時出現了很多不常在新聞報導中看到的詞彙像是“逗趣”。

## 5.歸類正負面文章

```{r}
ETTV_article_type = 
  ETTV_data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame()

UDN_article_type = 
  UDN_data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame() 

Apple_article_type = 
  Apple_data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame() 
```

```{r}
# negative_article:artUrl,word
ETTV_negative_article <-
ETTV_article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(ETTV_data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
ETTV_positive_article <-
ETTV_article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(ETTV_data_select[,c("artUrl", "word")], by = "artUrl")

# negative_article:artUrl,word
UDN_negative_article <-
UDN_article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(UDN_data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
UDN_positive_article <-
UDN_article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(UDN_data_select[,c("artUrl", "word")], by = "artUrl")

# negative_article:artUrl,word
Apple_negative_article <-
Apple_article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(Apple_data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
Apple_positive_article <-
Apple_article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(Apple_data_select[,c("artUrl", "word")], by = "artUrl")
```

### 聯合新聞的正負面文章

因為聯合新聞的正負詞彙最為豐富，所以我們特別來看一下聯合新聞的正負面文章中主要出現的情緒詞

```{r}
# 負面情緒關鍵字貢獻圖
UDN_negative_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "UDN Contribution to negative sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```
```{r}
# 正面情緒關鍵字貢獻圖
UDN_positive_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(20,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "UDN Contribution to positive sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

# Ch.4: 各個縣市對於水情關注的重點

## 1.ngram 
前後五個字彙
```{r}
ngram_11 <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    ngram <- ngrams(tokens, 11)
    ngram <- lapply(ngram, paste, collapse = " ")
    unlist(ngram)
  })
}
```

```{r}
water_ngram_11 <- MetaData %>%
  select(artUrl, sentence) %>%
  unnest_tokens(ngram, sentence, token = ngram_11) %>%
  filter(!str_detect(ngram, regex("[0-9a-zA-Z]")))
water_ngrams_11_separated <- water_ngram_11 %>%
  separate(ngram, paste0("word", c(1:11),sep=""), sep = " ")
head(water_ngrams_11_separated)
```

## 2.查看出常出現在「缺水」附近的字。
```{r}
# 查看缺水附近的詞彙
water_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "缺水")
#water_check_words
```
```{r}
water_check_words_count <- water_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word=value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","嚴重","解決"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>%
  count(word, sort = TRUE)

water_check_words_count %>%
  top_n(10,n) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("出現在「缺水」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light"))

# plot_merge <- data_tokens_date %>% 
#   filter(artDate == "2021-03-26"| 
#        artDate == "2021-03-22" | 
#        artDate == "2021-04-15" |
#        artDate == "2021-03-05") %>% 
#   group_by(artDate) %>% 
#   top_n(10,n) %>% 
#   ungroup() %>%
#   ggplot(aes(x= reorder(word, +n), y=n, fill = artDate)) +
#   geom_col(show.legend = FALSE) +
#   labs(x = NULL, y = NULL) +
#   facet_wrap(~artDate, scales="free", ncol = 2) +
#   coord_flip()+
#   theme(text = element_text(family = "Heiti TC Light"))

```

缺水周圍出現最多的是：缺電、各個地區、乾旱、危機

## 3.出現在各個縣市周圍的詞彙

台南、台中、台北、宜蘭、中南部
```{r}
# 查看缺水附近的詞彙
TN_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "台南")
head(TN_check_words)

TN_check_words_plot <- TN_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台南"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「台南」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
#TN_check_words_plot
```
```{r}
TC_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "台中") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台中"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#E69F00") +
  xlab("「台中」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

#TC_check_words_plot
```
```{r}
TP_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "台北") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台北"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#0072B2") +
  xlab("「台北」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

#TP_check_words_plot
```
```{r}
YL_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "宜蘭") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","宜蘭"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(5,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#D55E00") +
  xlab("「宜蘭」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

#YL_check_words_plot
```
```{r}
# 中南部
CN_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "中南部") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","中南部"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(7,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#009E73") +
  xlab("「中南部」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

```
```{r}
# 高雄
KH_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "高雄") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","高雄"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#CC79A7") +
  xlab("「高雄」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))


```

```{r}
# 合併多圖的function
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
 
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
 
  numPlots = length(plots)
 
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }
 
 if (numPlots==1) {
    print(plots[[1]])
 
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
 
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
 
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
```

```{r}
# The palette with grey:
# cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# 合併所有location的圖
multiplot(TN_check_words_plot, TC_check_words_plot, TP_check_words_plot,
          YL_check_words_plot, KH_check_words_plot, CN_check_words_plot, cols=2)
```

+ 台南、高雄:高雄支援台南用水2/25之後轉黃燈 所以暫停支援

+ 台中:台中、苗栗、新竹水情嚴種 4/6起停5供2

+ 宜蘭:台積電為解決缺水，傳出設廠宜蘭的消息

## 4.其他一些較常出現的詞彙

台積電、民生、農業
```{r}
# 查看「台積電」附近的詞彙
TSMC_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "台積電")
head(TSMC_check_words)

TSMC_check_words_plot <- TSMC_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  # filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>% 
  filter(!(word %in% c("台灣", "台積電"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「台積電」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
TSMC_check_words_plot
```
```{r}
# 查看「民生」附近的詞彙
MS_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "民生")
head(MS_check_words)

MS_check_words_plot <- MS_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  # filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>% 
  filter(!(word %in% c("台灣"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「民生」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
MS_check_words_plot
```
```{r}
# 查看「農業」附近的詞彙
AG_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "農業")
head(AG_check_words)

AG_check_words_plot <- AG_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  # filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>% 
  filter(!(word %in% c("台灣"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「農業」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
AG_check_words_plot
```

+ 台積電: 台灣易缺水缺電擬移設備至南京廠

+ 農業: 農業和半導體為兩大主要用水，缺水問題影響兩大產業

# Ch.5: 主題模型的分析

## 建立LDA模型
統計每篇文章詞頻
```{r}
water_tokens <- rbind(MToken[,c("artDate", "word","artTitle")],RToken[,c("artDate","word","artTitle")]) 

# 這邊要去掉停用字，以及自建的辭典
water_artid <- water_tokens %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>% 
  count(artTitle, word) %>% 
  rename(count=n) %>% 
  mutate(artId = group_indices(., artTitle))
head(water_artid)
```


```{r}
reserved_word <- water_artid %>% 
  group_by(word) %>% 
  count() %>% 
  filter(n > 5) %>% 
  unlist()

water_artid <- water_artid %>% 
  filter(word %in% reserved_word)
```

```{r}
# 轉換為DTM
water_com_dtm <- water_artid %>% 
  cast_dtm(artId, word, count)
water_com_dtm
```
```{r}
library(LDAvis)
library(topicmodels)
# 轉為分成兩群的LDA
water_lda <- LDA(water_com_dtm, k = 2, control = list(seed = 1234))

two_topics <- tidy(water_lda, matrix = "beta")
head(two_topics)
```
```{r}
# 看分出來的兩個topic中，最常出現的詞
top_terms <- two_topics %>%
  filter(!(term %in% c("台灣"))) %>% 
  filter(!(term %in% stop_words), nchar(term)>1) %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +  # 畫圖的部分
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))

top_terms

```

兩主題之間相差最大的詞彙

正越大表示越傾向主題二，負越大越傾向主題一，

```{r}
beta_spread <- two_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .0004 | topic2 > .0004 ) %>%
  mutate(log_ratio = log2(topic2 / topic1))

topic_ratio <- rbind(beta_spread %>% 
                         top_n(10,wt = log_ratio), 
                       beta_spread %>% 
                         top_n(-10, log_ratio)) %>%
  arrange(log_ratio)

topic_ratio %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
  geom_bar(stat="identity") + 
  xlab("Word")+
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))
```

LDAvis

只分為兩個主題出來的結果並不是很明確，這裡改成分為三個主題。
```{r}
topicmodels_json_ldavis <- function(fitted, doc_term){
    require(LDAvis)
    require(slam)
    phi <- as.matrix(posterior(fitted)$terms)
    theta <- as.matrix(posterior(fitted)$topics)
    vocab <- colnames(phi)
    term_freq <- slam::col_sums(doc_term)
    json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                            vocab = vocab,
                            doc.length = as.vector(table(doc_term$i)),
                            term.frequency = term_freq)
    return(json_lda)
}
```

```{r}
# water_ldavis <- LDA(water_com_dtm, k = 3, control = list(seed = 1234))
# json_res <- topicmodels_json_ldavis(water_ldavis,water_com_dtm)
# serVis(json_res, open.browser = T)
```
![](topic1.png)
![](topic2.png)
![](topic3.png)


# Ch.6：其他

我們分析了Twitter上有關 #Taiwan 和 #drought 作為關鍵字的貼文

![](twitter.png)


發現大家關注的焦點主要是在，晶片和半導體

# Ch.7：結論

+ 大家的討論主要還是負面情緒居多，負面情緒的來源主要是希望政府可以對於缺水的情況有更多的作為，以及對於未來水情的擔憂。
+ 各個縣市對於缺水狀況關注的重點，各有不同。
+ 全球對於台灣缺水狀況的反映，主要是對於全球晶片產能的擔憂。
