Ch.0 : 動機分析目的與資料取得
1. 動機與分析目的
去年台灣創下56年來首度沒有颱風登陸的紀錄,這也讓水庫無水進帳,全台面臨缺水的危機,在水利署於2月25日發出的水情燈號中,已有許多縣市進入減量供水的燈色號燈。由於越來越嚴峻的水情,高雄市自4月17日起停止洗車場、公私立游泳池供水等,市府並配合水利署加速開鑿水井。在宿舍的浴室中也出現了應對停水的大水桶。
我們想探討:
- 在PTT以及Dcard人們對於水情所討論的焦點都有哪些,情緒變化是怎樣的?
- 各家媒體對於水情的報導有何不同?
- 各個縣市對於缺水狀況關注的重點是否一樣?
- 全球對於台灣水情關心的重點又在哪裏?
2. 資料取得及套件載入
資料基本介紹
- 資料來源: 文字平台收集PTT Gossip版 + Dcard時事版
- 資料集: PPTnowater_articleMetaData.csv
- 關鍵字:缺水、水情、水庫、下雨
- 2021/02/01 ~ 2021/04/25 共得到 886 篇文章。
簡單的資料比較

Dcard的資料量遠少於PTT。可以看出年輕人對於水情的關注度不高。
系統參數設定
[1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/en_US.UTF-8"
安裝需要的packages
# echo = T,results = 'hide'
packages = c("dplyr", "tidytext", "stringr", "wordcloud2", "ggplot2",'readr','data.table','reshape2','wordcloud','tidyr','scales','ngram')
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
讀進library
Ch.1 : 資料的基本分析
1. 資料前處理
(1). 文章斷詞
設定斷詞引擎
# 把文章和留言的斷詞結果併在一起
MToken <- MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
RToken <- Reviews %>% unnest_tokens(word, cmtContent, token=customized_tokenizer)
# 把資料併在一起
data <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl","word")])
head(data,20)
(2). 資料基本清理
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
2. 準備LIWC字典
全名Linguistic Inquiry and Word Counts,由心理學家Pennebaker於2001出版 分為正向情緒與負向情緒
#讀檔,字詞間以","將字分隔
P <- read_file("dict/liwc/positive.txt") # 正向字典txt檔
N <- read_file("dict/liwc/negative.txt") # 負向字典txt檔
# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]
# 建立dataframe 有兩個欄位word,sentiments,word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047
# 把兩個字典拼在一起
LIWC = rbind(P, N)
Ch.2 : 情緒分析以及文字雲
1. 將文章和與LIWC情緒字典做join
發文折線圖

找出文集中,對於LIWC字典是positive和negative的字
算出每天情緒總和(sentiment_count)
Joining, by = "word"
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
三月正負情緒分數折線圖
[1] "2021-02-05" "2021-04-28"

四月正負情緒分數折線圖
[1] "2021-02-05" "2021-04-28"

2. 畫出文字雲
2021-04-25 文字雲
Adding missing grouping variables: `artDate`
因為高雄市長陳其邁昨在臉書PO文分享穿西裝淋雨罩,引發熱議。
2021-03-05 文字雲

2021-04-12 文字雲

水情嚴峻 黃偉哲:台南水庫用水可撐到7月底

3.找出情緒字典代表字
算出所有字詞的詞頻(sentiment_sum),找出情緒代表字
正負情緒代表字
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

另外一種呈現方式
正負情緒文字雲

另外,也可以依據不同日期觀察情緒代表字的變化
2021-04-12 正負情緒代表字
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

2021-04-12 正負情緒文字雲

4.歸類正負面文章
之前的情緒分析大部分是全部的詞彙加總,接下來將正負面情緒的文章分開,看看能不能發現一些新的東西。接下來歸類文章,將每一篇文章正負面情緒的分數算出來,然後大概分類文章屬於正面還是負面。
Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
正負情緒文章數量統計圖
已缺水事件來說,負面情緒的文章比較多
#
article_type_date = left_join(article_type[,c("artUrl", "type")], MetaData[,c("artUrl", "artDate")], by = "artUrl")
article_type_date %>%
group_by(artDate,type) %>%
summarise(count = n()) %>%
ggplot(aes(x = artDate, y = count, fill = type)) +
geom_bar(stat = "identity", position = "dodge")+
scale_x_date(labels = date_format("%m/%d"),
limits = as.Date(c('2021-03-01','2021-04-30'))
)
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

把正面和負面的文章挑出來,並和斷詞結果合併。
畫出正負面文章情緒貢獻度較高的關鍵字
情緒關鍵字:負面情緒文章
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

情緒關鍵字:正面情緒文章
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

Ch.3: 各家新聞媒體的情緒分析
1. 資料前處理
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
2.將文章和與LIWC情緒字典做join
在畫出情緒之前,先看看每天的發文情形。
各家新聞發文折線圖
ETTV_MetaData$artDate= ETTV_MetaData$artDate %>% as.Date("%Y/%m/%d")
UDN_MetaData$artDate= UDN_MetaData$artDate %>% as.Date("%Y/%m/%d")
Apple_MetaData$artDate= Apple_MetaData$artDate %>% as.Date("%Y/%m/%d")
ETTV_Post <- ETTV_MetaData %>% group_by(artDate) %>% summarise(count = n())
UDN_Post <- UDN_MetaData %>% group_by(artDate) %>% summarise(count = n())
Apple_Post <- Apple_MetaData %>% group_by(artDate) %>% summarise(count = n())
ggplot()+
geom_line(data = ETTV_Post, aes(x=artDate,y=count,colour = "聯合新聞網"))+
scale_x_date(labels = date_format("%m/%d"))+
geom_line(data = UDN_Post, aes(x=artDate,y=count,colour ="東森新聞網"))+
scale_x_date(labels = date_format("%m/%d"))+
geom_line(data = Apple_Post, aes(x=artDate,y=count,colour ="蘋果新聞網"))+
scale_x_date(labels = date_format("%m/%d"))+
scale_colour_manual("",values = c("聯合新聞網" = "red","東森新聞網" = "blue", "蘋果新聞網" = "black"))+
theme(text=element_text(size=14,family = "Heiti TC Light"))
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

3.將三個新聞網的資料合併做比較
Joining, by = "word"
`summarise()` has grouped output by 'artDate', 'sentiment'. You can override using the `.groups` argument.

可以看出:
- 只有聯合新聞是有連續每天的在跟進水情的新聞
- 聯合新聞的情緒詞出現的較多
- 後期的報導都以正面情緒居多
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.
4.算出所有字詞詞頻後,各新聞網最常出現的情緒代表字



可以看出東森新聞的報導情緒詞較少,較為中立客觀。
而聯合新聞中的情緒詞則非常豐富,同時出現了很多不常在新聞報導中看到的詞彙像是“逗趣”。
5.歸類正負面文章
Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
Joining, by = "word"
`summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
# negative_article:artUrl,word
ETTV_negative_article <-
ETTV_article_type %>%
filter(type=="negative")%>%
select(artUrl) %>%
left_join(ETTV_data_select[,c("artUrl", "word")], by = "artUrl")
# positive_article:artUrl,word
ETTV_positive_article <-
ETTV_article_type %>%
filter(type=="positive")%>%
select(artUrl) %>%
left_join(ETTV_data_select[,c("artUrl", "word")], by = "artUrl")
# negative_article:artUrl,word
UDN_negative_article <-
UDN_article_type %>%
filter(type=="negative")%>%
select(artUrl) %>%
left_join(UDN_data_select[,c("artUrl", "word")], by = "artUrl")
# positive_article:artUrl,word
UDN_positive_article <-
UDN_article_type %>%
filter(type=="positive")%>%
select(artUrl) %>%
left_join(UDN_data_select[,c("artUrl", "word")], by = "artUrl")
# negative_article:artUrl,word
Apple_negative_article <-
Apple_article_type %>%
filter(type=="negative")%>%
select(artUrl) %>%
left_join(Apple_data_select[,c("artUrl", "word")], by = "artUrl")
# positive_article:artUrl,word
Apple_positive_article <-
Apple_article_type %>%
filter(type=="positive")%>%
select(artUrl) %>%
left_join(Apple_data_select[,c("artUrl", "word")], by = "artUrl")
聯合新聞的正負面文章
因為聯合新聞的正負詞彙最為豐富,所以我們特別來看一下聯合新聞的正負面文章中主要出現的情緒詞
Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

Joining, by = "word"
`summarise()` has grouped output by 'word'. You can override using the `.groups` argument.

Ch.4: 各個縣市對於水情關注的重點
2.查看出常出現在「缺水」附近的字。
water_check_words_count <- water_check_words %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word=value) %>%
filter(variable!="word6") %>%
filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","嚴重","解決"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>%
count(word, sort = TRUE)
water_check_words_count %>%
top_n(10,n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = n > 0)) +
geom_col(show.legend = FALSE) +
xlab("出現在「缺水」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light"))

缺水周圍出現最多的是:缺電、各個地區、乾旱、危機
3.出現在各個縣市周圍的詞彙
台南、台中、台北、宜蘭、中南部
TN_check_words_plot <- TN_check_words %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台南"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(8,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE, fill="#999999") +
xlab("「台南」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
#TN_check_words_plot
TC_check_words_plot <- water_ngrams_11_separated %>%
filter(word6 == "台中") %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台中"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(8,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE,fill="#E69F00") +
xlab("「台中」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
#TC_check_words_plot
TP_check_words_plot <- water_ngrams_11_separated %>%
filter(word6 == "台北") %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台北"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(8,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE, fill="#0072B2") +
xlab("「台北」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
#TP_check_words_plot
YL_check_words_plot <- water_ngrams_11_separated %>%
filter(word6 == "宜蘭") %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","宜蘭"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(5,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE,fill="#D55E00") +
xlab("「宜蘭」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
#YL_check_words_plot
# 中南部
CN_check_words_plot <- water_ngrams_11_separated %>%
filter(word6 == "中南部") %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","中南部"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(7,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE,fill="#009E73") +
xlab("「中南部」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
# 高雄
KH_check_words_plot <- water_ngrams_11_separated %>%
filter(word6 == "高雄") %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","高雄"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(8,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE,fill="#CC79A7") +
xlab("「高雄」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
# 合併多圖的function
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
# The palette with grey:
# cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# 合併所有location的圖
multiplot(TN_check_words_plot, TC_check_words_plot, TP_check_words_plot,
YL_check_words_plot, KH_check_words_plot, CN_check_words_plot, cols=2)

4.其他一些較常出現的詞彙
台積電、民生、農業
TSMC_check_words_plot <- TSMC_check_words %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
# filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>%
filter(!(word %in% c("台灣", "台積電"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(8,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE, fill="#999999") +
xlab("「台積電」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
TSMC_check_words_plot

MS_check_words_plot <- MS_check_words %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
# filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>%
filter(!(word %in% c("台灣"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(8,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE, fill="#999999") +
xlab("「民生」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
MS_check_words_plot

AG_check_words_plot <- AG_check_words %>%
melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
rename(word = value) %>%
filter(variable!="word6") %>%
# filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>%
filter(!(word %in% c("台灣"))) %>%
filter(!(word %in% stop_words), nchar(word)>1) %>% ###
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(8,n) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE, fill="#999999") +
xlab("「農業」附近的字") +
ylab("出現次數") +
coord_flip()+
theme(text = element_text(family = "Heiti TC Light",size=10))
AG_check_words_plot

Ch.5: 主題模型的分析
建立LDA模型
統計每篇文章詞頻
water_tokens <- rbind(MToken[,c("artDate", "word","artTitle")],RToken[,c("artDate","word","artTitle")])
# 這邊要去掉停用字,以及自建的辭典
water_artid <- water_tokens %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
count(artTitle, word) %>%
rename(count=n) %>%
mutate(artId = group_indices(., artTitle))
The `...` argument of `group_keys()` is deprecated as of dplyr 1.0.0.
Please `group_by()` first
<<DocumentTermMatrix (documents: 741, terms: 3173)>>
Non-/sparse entries: 63859/2287334
Sparsity : 97%
Maximal term length: 5
Weighting : term frequency (tf)

兩主題之間相差最大的詞彙
正越大表示越傾向主題二,負越大越傾向主題一,

LDAvis
只分為兩個主題出來的結果並不是很明確,這裡改成分為三個主題。
To stop the server, run servr::daemon_stop(1) or restart your R session
Serving the directory /private/var/folders/ww/lpf7_83x4hb2q8pwlyrcj0vm0000gn/T/RtmpAKESbp/file4cbb29a857bb at http://127.0.0.1:4321

Ch.6:其他
我們分析了Twitter上有關 #Taiwan 和 #drought 作為關鍵字的貼文

發現大家關注的焦點主要是在,晶片和半導體
Ch.7:結論
- 大家的討論主要還是負面情緒居多,負面情緒的來源主要是希望政府可以對於缺水的情況有更多的作為,以及對於未來水情的擔憂。
- 各個縣市對於缺水狀況關注的重點,各有不同。
- 全球對於台灣缺水狀況的反映,主要是對於全球晶片產能的擔憂。
---
title: "社群媒體期中報告 - 最近兩個月水情的討論分析"
author: "張惠茹、王弘銘、陳宥任、葉思卿"
date: "2021/4/30"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    highlight: pygments
    theme: flatly
    css: style.css
  html_document:
    toc: yes
    df_print: paged
---

# Ch.0 : 動機分析目的與資料取得

## 1. 動機與分析目的

去年台灣創下56年來首度沒有颱風登陸的紀錄，這也讓水庫無水進帳，全台面臨缺水的危機，在水利署於2月25日發出的水情燈號中，已有許多縣市進入減量供水的燈色號燈。由於越來越嚴峻的水情，高雄市自4月17日起停止洗車場、公私立游泳池供水等，市府並配合水利署加速開鑿水井。在宿舍的浴室中也出現了應對停水的大水桶。

我們想探討：

+ 在PTT以及Dcard人們對於水情所討論的焦點都有哪些，情緒變化是怎樣的？
+ 各家媒體對於水情的報導有何不同？
+ 各個縣市對於缺水狀況關注的重點是否一樣？
+ 全球對於台灣水情關心的重點又在哪裏？

## 2. 資料取得及套件載入

### 資料基本介紹

+ 資料來源: 文字平台收集PTT Gossip版 + Dcard時事版
+ 資料集： PPTnowater_articleMetaData.csv
+ 關鍵字：缺水、水情、水庫、下雨
+ 2021/02/01 ~ 2021/04/25 共得到 886 篇文章。

簡單的資料比較

![](compare.png)

Dcard的資料量遠少於PTT。可以看出年輕人對於水情的關注度不高。

系統參數設定
```{r,warning=FALSE,message=FALSE}
Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼
```

安裝需要的packages
```{r warning=FALSE}
# echo = T,results = 'hide'
packages = c("dplyr", "tidytext", "stringr", "wordcloud2", "ggplot2",'readr','data.table','reshape2','wordcloud','tidyr','scales','ngram')
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
```

讀進library
```{r,warning=FALSE,message=FALSE}
library(dplyr)
library(stringr)
library(tidytext)
library(wordcloud2)
library(data.table)
library(ggplot2)
library(reshape2)
library(wordcloud)
library(tidyr)
library(readr)
library(scales)
library(jiebaR)
library(ngram)

library(gutenbergr)
library(widyr)
library(NLP)
library(ggraph)
library(igraph)
library(tm)
library(slam)
library(Rtsne)
library(randomcoloR)
library(topicmodels)
library(LDAvis)
library(webshot)
library(htmlwidgets)
library(servr)
```
```{r}
setwd("/Users/a1234/Downloads/project")
```

```{r}
# 把文章和留言讀進來
MetaData = read.csv('PPTnowater_articleMetaData.csv',encoding = 'UTF-8')
Reviews  = read.csv('PTTnowater_articleReviews.csv',encoding = 'UTF-8')

MetaData$sentence <- as.character(MetaData$sentence)
Reviews$cmtContent <- as.character(Reviews$cmtContent)

### 移除PTT貼新聞時會出現的格式用字
MetaData <- MetaData %>% 
  mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", sentence))

# 挑選文章對應的留言
Reviews = left_join(MetaData, Reviews[,c("artUrl", "cmtContent")], by = "artUrl")
```

# Ch.1 : 資料的基本分析

## 1. 資料前處理

+ 文章斷詞
+ 資料基本清理

(1). 文章斷詞

設定斷詞引擎
```{r}
# 加入自定義的字典
jieba_tokenizer <- worker(user="dict/user_dict.txt", stop_word = "dict/stop_words.txt")

# 設定斷詞function
customized_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    tokens <- tokens[nchar(tokens)>1]
    return(tokens)
  })
}
```

```{r}
# 把文章和留言的斷詞結果併在一起
MToken <- MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
RToken <- Reviews %>% unnest_tokens(word, cmtContent, token=customized_tokenizer)

# 把資料併在一起
data <- rbind(MToken[,c("artDate","artUrl", "word")],RToken[,c("artDate","artUrl","word")]) 
```

(2). 資料基本清理

+ 日期格式化
+ 去除特殊字元、詞頻太低的字

```{r}
# 格式化日期欄位
data$artDate= data$artDate %>% as.Date("%Y/%m/%d")

# 過濾特殊字元
data_select = data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1) 
  
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))
head(word_count)
```


## 2. 準備LIWC字典

> 全名Linguistic Inquiry and Word Counts，由心理學家Pennebaker於2001出版
> 分為正向情緒與負向情緒


```{r}
#讀檔，字詞間以","將字分隔
P <- read_file("dict/liwc/positive.txt") # 正向字典txt檔
N <- read_file("dict/liwc/negative.txt") # 負向字典txt檔

# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]

# 建立dataframe 有兩個欄位word,sentiments，word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047

# 把兩個字典拼在一起
LIWC = rbind(P, N)
```

# Ch.2 : 情緒分析以及文字雲

## 1. 將文章和與LIWC情緒字典做join

### 發文折線圖
```{r}
MetaData$artDate= MetaData$artDate %>% as.Date("%Y/%m/%d")
MetaData %>%
  group_by(artDate) %>%
  summarise(count = n()) %>%
  ggplot()+
    geom_line(aes(x=artDate,y=count))+
    scale_x_date(labels = date_format("%m/%d"))+
  geom_vline(aes(xintercept = as.numeric(artDate[which(artDate == as.Date('2021-04-12'))
[1]])),colour = "red") 
```

> 找出文集中，對於LIWC字典是positive和negative的字

算出每天情緒總和(sentiment_count)
```{r}
# sentiment_count:artDate,sentiment,count
sentiment_count = data_select %>%
  select(artDate,word) %>%
  inner_join(LIWC) %>% 
  group_by(artDate,sentiment) %>%
  summarise(count=n())  
```


### 三月正負情緒分數折線圖
```{r}
# 檢視資料的日期區間
range(sentiment_count$artDate)
sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-03-01','2021-03-31'))
               )
```

### 四月正負情緒分數折線圖
```{r}
# 檢視資料的日期區間
range(sentiment_count$artDate)
sentiment_count %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-04-01','2021-04-30'))
               )+
  # 加上標示日期的線
  geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-04-25'))
[1]])),colour = "black")+
  geom_vline(aes(xintercept = as.numeric(artDate[which(sentiment_count$artDate == as.Date('2021-04-03'))
[1]])),colour = "black") 
```

+ 可以看出：3月負面情緒佔據了主導地位，4月又幾天的正面情緒超過了負面情緒

+ 主要原因是4月開始逐漸降雨量增多，尤其是4.25前後高雄有持續幾天的陰天降雨

```{r}
# 查看每天的情緒分數排名
show_top3 <- sentiment_count %>%
  select(count,artDate) %>%
  group_by(artDate) %>%
  summarise(sum = sum(count)) %>%
  arrange(desc(sum))

head(show_top3,3)
```

## 2. 畫出文字雲

### 2021-04-25 文字雲
```{r}
# 畫出文字雲
word_count %>%
  filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>%
  filter(artDate == as.Date('2021-04-25')) %>% 
  select(word,count) %>% 
  group_by(word) %>% 
  summarise(count = sum(count)) %>%
  arrange(desc(count)) %>%
  filter(count>30) %>%   # 過濾出現太少次的字
  wordcloud2()
```

因為高雄市長陳其邁昨在臉書PO文分享穿西裝淋雨罩，引發熱議。

### 2021-03-05 文字雲
```{r}
# 畫出文字雲
# plot_0305=word_count %>%
#   filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>%
#   filter(artDate == as.Date('2021-03-05')) %>% 
#   select(word,count) %>% 
#   group_by(word) %>% 
#   summarise(count = sum(count)) %>%
#   arrange(desc(count)) %>%
#   filter(count>20) %>%   # 過濾出現太少次的字
#   wordcloud2()
# plot_0305
```
![](305.png)

### 2021-04-12 文字雲
```{r,warning=FALSE,message=FALSE}
# 畫出文字雲
# plot_0412 = word_count %>% 
#   filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>%
#   filter(artDate == as.Date('2021-04-12')) %>% 
#   select(word,count) %>% 
#   group_by(word) %>% 
#   summarise(count = sum(count)) %>%
#   arrange(desc(count)) %>%
#   filter(count>20) %>%   # 過濾出現太少次的字
#   wordcloud2()
# plot_0412
```
![](412.png)

水情嚴峻 黃偉哲：台南水庫用水可撐到7月底
		
```{r}
data_tokens_date <- data_select %>% 
  filter(!(word %in% c("缺水","水庫","下雨","台灣"))) %>% 
  count(artDate, word, sort = TRUE)
data_tokens_date
data_tokens_date$artDate <- data_tokens_date$artDate %>% as.Date("%Y/%m/%d")

plot_merge <- data_tokens_date %>% 
  filter(artDate =="2021-04-12"| 
       artDate == "2021-03-05"| 
       artDate == "2021-04-25" |
        artDate == "2021-03-27")%>%
  group_by(artDate) %>%
  top_n(10,n)%>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(x= word, y=n)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = NULL) +
  facet_wrap(~artDate, scales="free", ncol = 2) +
  coord_flip()+
  theme(text = element_text(family = "Heiti TC Light"))
plot_merge
```

+ 清淤：水庫存量下降、水情吃緊，卻也意外迎來清淤的好時機，因此水利署加大清淤力道，去年的清淤量達1440萬立方公尺，創下歷史新高紀錄。

+ 超前部署：超前部署，這3年多來對於區域供水，比如把翡翠水庫的水引到新北、石門水庫引到新竹，不然會更加嚴重。


## 3.找出情緒字典代表字

算出所有字詞的詞頻(sentiment_sum)，找出情緒代表字

### 正負情緒代表字
```{r}
# sentiment_sum:word,sentiment,sum
sentiment_sum <- 
  word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 
  
sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

另外一種呈現方式

### 正負情緒文字雲
```{r}
# sentiment_sum %>%
#   acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
#   comparison.cloud(
#     colors = c("salmon", "#72bcd4"), # positive negative
#                    max.words = 50,family = "Heiti TC Light")
```
![](正負情緒文字雲.png)

另外，也可以依據不同日期觀察情緒代表字的變化

### 2021-04-12 正負情緒代表字
```{r}
sentiment_sum_select <- 
word_count %>%
  filter(artDate == as.Date('2021-04-12')) %>% 
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 

sentiment_sum_select   %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment 0412",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```


### 2021-04-12 正負情緒文字雲
```{r}
# sentiment_sum_select %>%
#   acast(word ~ sentiment, value.var = "sum", fill = 0) %>%
#   comparison.cloud(
#     colors = c("salmon", "#72bcd4"), # positive negative
#                    max.words = 50,family = "Heiti TC Light")
```

![](2021-04-12正負情緒文字雲.png)


## 4.歸類正負面文章

之前的情緒分析大部分是全部的詞彙加總，接下來將正負面情緒的文章分開，看看能不能發現一些新的東西。接下來歸類文章，將每一篇文章正負面情緒的分數算出來，然後大概分類文章屬於正面還是負面。

```{r}
# 依據情緒值的正負比例歸類文章
article_type = 
  data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame() 
  
# 看一下正負比例的文章各有幾篇
article_type %>%
  group_by(type) %>%
  summarise(count = n())
```


### 正負情緒文章數量統計圖

已缺水事件來說,負面情緒的文章比較多

```{r}
# 
article_type_date = left_join(article_type[,c("artUrl", "type")], MetaData[,c("artUrl", "artDate")], by = "artUrl")


article_type_date %>%
  group_by(artDate,type) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = artDate, y = count, fill = type)) + 
  geom_bar(stat = "identity", position = "dodge")+
  scale_x_date(labels = date_format("%m/%d"),
               limits = as.Date(c('2021-03-01','2021-04-30'))
               )
```

把正面和負面的文章挑出來，並和斷詞結果合併。

```{r}
# negative_article:artUrl,word
negative_article <-
article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
positive_article <-
article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(data_select[,c("artUrl", "word")], by = "artUrl")
```


畫出正負面文章情緒貢獻度較高的關鍵字

### 情緒關鍵字:負面情緒文章
```{r}
# 負面情緒關鍵字貢獻圖
negative_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to negative sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

### 情緒關鍵字:正面情緒文章
```{r}
# 正面情緒關鍵字貢獻圖
positive_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to positive sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

# Ch.3: 各家新聞媒體的情緒分析

+ ETTV：東森
+ UDN：聯合
+ Apple：蘋果

## 1. 資料前處理
```{r}
# 把文章讀進來
ETTV_MetaData = fread('ETTV_articleMetaData.csv',encoding = 'UTF-8')
UDN_MetaData = fread('UDN_articleMetaData.csv',encoding = 'UTF-8')
Apple_MetaData = fread('Apple_articleMetaData.csv',encoding = 'UTF-8')
```

```{r}
# 斷詞結果
ETTV_data <- ETTV_MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
UDN_data <- UDN_MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
Apple_data <- Apple_MetaData %>% unnest_tokens(word, sentence, token=customized_tokenizer)
```

```{r}
# 格式化日期欄位
ETTV_data$artDate= ETTV_data$artDate %>% as.Date("%Y/%m/%d")
UDN_data$artDate= UDN_data$artDate %>% as.Date("%Y/%m/%d")
Apple_data$artDate= Apple_data$artDate %>% as.Date("%Y/%m/%d")
# 過濾特殊字元
ETTV_data_select = ETTV_data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1) 
UDN_data_select = UDN_data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1)
Apple_data_select = Apple_data %>% 
  filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
  filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
  filter(nchar(.$word)>1)

ETTV_word_count <- ETTV_data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))

UDN_word_count <- UDN_data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))

Apple_word_count <- Apple_data_select %>%
  select(artDate,word) %>%
  group_by(artDate,word) %>%
  summarise(count=n()) %>%  # 算字詞單篇總數用summarise
  filter(count>3) %>%  # 過濾出現太少次的字
  arrange(desc(count))
```

## 2.將文章和與LIWC情緒字典做join

在畫出情緒之前，先看看每天的發文情形。

### 各家新聞發文折線圖
```{r}
ETTV_MetaData$artDate= ETTV_MetaData$artDate %>% as.Date("%Y/%m/%d")
UDN_MetaData$artDate= UDN_MetaData$artDate %>% as.Date("%Y/%m/%d")
Apple_MetaData$artDate= Apple_MetaData$artDate %>% as.Date("%Y/%m/%d")

ETTV_Post <- ETTV_MetaData %>% group_by(artDate) %>% summarise(count = n())
UDN_Post <- UDN_MetaData %>% group_by(artDate) %>% summarise(count = n())
Apple_Post <- Apple_MetaData %>% group_by(artDate) %>% summarise(count = n())


ggplot()+
  geom_line(data = ETTV_Post, aes(x=artDate,y=count,colour = "聯合新聞網"))+
  scale_x_date(labels = date_format("%m/%d"))+
  geom_line(data = UDN_Post, aes(x=artDate,y=count,colour ="東森新聞網"))+
  scale_x_date(labels = date_format("%m/%d"))+
  geom_line(data = Apple_Post, aes(x=artDate,y=count,colour ="蘋果新聞網"))+
  scale_x_date(labels = date_format("%m/%d"))+
  scale_colour_manual("",values = c("聯合新聞網" = "red","東森新聞網" = "blue", "蘋果新聞網" = "black"))+
  theme(text=element_text(size=14,family = "Heiti TC Light"))
```

## 3.將三個新聞網的資料合併做比較

```{r}
ETTV_data <- ETTV_data %>% mutate(source = 'ETTV')
UDN_data <- UDN_data %>% mutate(source = 'UDN')
Apple_data <- Apple_data %>% mutate(source = 'Apple')

data_combine = rbind(ETTV_data,UDN_data, Apple_data)
data_combine$artDate= data_combine$artDate %>% as.Date("%Y/%m/%d")
```

```{r}
data_combine %>%
  inner_join(LIWC) %>%
  group_by(artDate,sentiment,source) %>%
  summarise(count = n()) %>%
  filter(artDate>='2021-03-01') %>%
  
  # 畫圖的部分
  ggplot(aes(x= artDate,y=count,fill=sentiment)) +
  scale_color_manual() +
  geom_col(position="dodge") + 
  scale_x_date(labels = date_format("%m/%d")) +
  labs(title = "sentiment of ptt & dcard",color = "情緒類別") +
  facet_wrap(~source, ncol = 1, scales="free_y")  # scale可以調整比例尺
```


可以看出：

+ 只有聯合新聞是有連續每天的在跟進水情的新聞
+ 聯合新聞的情緒詞出現的較多
+ 後期的報導都以正面情緒居多

```{r}
ETTV_sentiment_sum <- 
  ETTV_word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 

UDN_sentiment_sum <- 
  UDN_word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame()

Apple_sentiment_sum <- 
  Apple_word_count %>%
    inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = sum(count)
  ) %>% 
  arrange(desc(sum)) %>%
  data.frame() 
```
## 4.算出所有字詞詞頻後，各新聞網最常出現的情緒代表字

```{r}
ETTV_sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "ETTV Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

```{r}
UDN_sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "UDN Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

```{r}
Apple_sentiment_sum %>%
  top_n(30,wt = sum) %>%
  mutate(word = reorder(word, sum)) %>%
  filter(!(word %in% c("作品"))) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Apple Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

可以看出東森新聞的報導情緒詞較少，較為中立客觀。

而聯合新聞中的情緒詞則非常豐富，同時出現了很多不常在新聞報導中看到的詞彙像是“逗趣”。

## 5.歸類正負面文章

```{r}
ETTV_article_type = 
  ETTV_data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame()

UDN_article_type = 
  UDN_data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame() 

Apple_article_type = 
  Apple_data_select %>%
  inner_join(LIWC) %>% 
  group_by(artUrl,sentiment) %>%
  summarise(count=n()) %>%
  spread(sentiment,count,fill = 0) %>% #把正負面情緒展開，缺值補0
  mutate(type = case_when(positive > negative ~ "positive", 
                             TRUE ~ "negative")) %>%
  data.frame() 
```

```{r}
# negative_article:artUrl,word
ETTV_negative_article <-
ETTV_article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(ETTV_data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
ETTV_positive_article <-
ETTV_article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(ETTV_data_select[,c("artUrl", "word")], by = "artUrl")

# negative_article:artUrl,word
UDN_negative_article <-
UDN_article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(UDN_data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
UDN_positive_article <-
UDN_article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(UDN_data_select[,c("artUrl", "word")], by = "artUrl")

# negative_article:artUrl,word
Apple_negative_article <-
Apple_article_type %>%
  filter(type=="negative")%>%
  select(artUrl) %>%
  left_join(Apple_data_select[,c("artUrl", "word")], by = "artUrl")

# positive_article:artUrl,word
Apple_positive_article <-
Apple_article_type %>%
  filter(type=="positive")%>%
  select(artUrl) %>%
  left_join(Apple_data_select[,c("artUrl", "word")], by = "artUrl")
```

### 聯合新聞的正負面文章

因為聯合新聞的正負詞彙最為豐富，所以我們特別來看一下聯合新聞的正負面文章中主要出現的情緒詞

```{r}
# 負面情緒關鍵字貢獻圖
UDN_negative_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(30,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "UDN Contribution to negative sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```
```{r}
# 正面情緒關鍵字貢獻圖
UDN_positive_article %>%
inner_join(LIWC) %>%
    group_by(word,sentiment) %>%
  summarise(
    sum = n()
    )%>% 
  arrange(desc(sum)) %>%
  data.frame() %>%
  top_n(20,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "UDN Contribution to positive sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = "Heiti TC Light"))+
  coord_flip()
```

# Ch.4: 各個縣市對於水情關注的重點

## 1.ngram 
前後五個字彙
```{r}
ngram_11 <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    ngram <- ngrams(tokens, 11)
    ngram <- lapply(ngram, paste, collapse = " ")
    unlist(ngram)
  })
}
```

```{r}
water_ngram_11 <- MetaData %>%
  select(artUrl, sentence) %>%
  unnest_tokens(ngram, sentence, token = ngram_11) %>%
  filter(!str_detect(ngram, regex("[0-9a-zA-Z]")))
water_ngrams_11_separated <- water_ngram_11 %>%
  separate(ngram, paste0("word", c(1:11),sep=""), sep = " ")
head(water_ngrams_11_separated)
```

## 2.查看出常出現在「缺水」附近的字。
```{r}
# 查看缺水附近的詞彙
water_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "缺水")
#water_check_words
```
```{r}
water_check_words_count <- water_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word=value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","嚴重","解決"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>%
  count(word, sort = TRUE)

water_check_words_count %>%
  top_n(10,n) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("出現在「缺水」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light"))

# plot_merge <- data_tokens_date %>% 
#   filter(artDate == "2021-03-26"| 
#        artDate == "2021-03-22" | 
#        artDate == "2021-04-15" |
#        artDate == "2021-03-05") %>% 
#   group_by(artDate) %>% 
#   top_n(10,n) %>% 
#   ungroup() %>%
#   ggplot(aes(x= reorder(word, +n), y=n, fill = artDate)) +
#   geom_col(show.legend = FALSE) +
#   labs(x = NULL, y = NULL) +
#   facet_wrap(~artDate, scales="free", ncol = 2) +
#   coord_flip()+
#   theme(text = element_text(family = "Heiti TC Light"))

```

缺水周圍出現最多的是：缺電、各個地區、乾旱、危機

## 3.出現在各個縣市周圍的詞彙

台南、台中、台北、宜蘭、中南部
```{r}
# 查看缺水附近的詞彙
TN_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "台南")
head(TN_check_words)

TN_check_words_plot <- TN_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台南"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「台南」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
#TN_check_words_plot
```
```{r}
TC_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "台中") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台中"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#E69F00") +
  xlab("「台中」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

#TC_check_words_plot
```
```{r}
TP_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "台北") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","台北"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#0072B2") +
  xlab("「台北」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

#TP_check_words_plot
```
```{r}
YL_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "宜蘭") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","宜蘭"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(5,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#D55E00") +
  xlab("「宜蘭」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

#YL_check_words_plot
```
```{r}
# 中南部
CN_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "中南部") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","中南部"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(7,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#009E73") +
  xlab("「中南部」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))

```
```{r}
# 高雄
KH_check_words_plot <- water_ngrams_11_separated %>% 
  filter(word6 == "高雄") %>% 
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","高雄"))) %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE)  %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE,fill="#CC79A7") +
  xlab("「高雄」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))


```

```{r}
# 合併多圖的function
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
 
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
 
  numPlots = length(plots)
 
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }
 
 if (numPlots==1) {
    print(plots[[1]])
 
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
 
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
 
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
```

```{r}
# The palette with grey:
# cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# 合併所有location的圖
multiplot(TN_check_words_plot, TC_check_words_plot, TP_check_words_plot,
          YL_check_words_plot, KH_check_words_plot, CN_check_words_plot, cols=2)
```

+ 台南、高雄:高雄支援台南用水2/25之後轉黃燈 所以暫停支援

+ 台中:台中、苗栗、新竹水情嚴種 4/6起停5供2

+ 宜蘭:台積電為解決缺水，傳出設廠宜蘭的消息

## 4.其他一些較常出現的詞彙

台積電、民生、農業
```{r}
# 查看「台積電」附近的詞彙
TSMC_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "台積電")
head(TSMC_check_words)

TSMC_check_words_plot <- TSMC_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  # filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>% 
  filter(!(word %in% c("台灣", "台積電"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「台積電」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
TSMC_check_words_plot
```
```{r}
# 查看「民生」附近的詞彙
MS_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "民生")
head(MS_check_words)

MS_check_words_plot <- MS_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  # filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>% 
  filter(!(word %in% c("台灣"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「民生」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
MS_check_words_plot
```
```{r}
# 查看「農業」附近的詞彙
AG_check_words <- water_ngrams_11_separated %>%
  filter(word6 == "農業")
head(AG_check_words)

AG_check_words_plot <- AG_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:11),sep="")) %>%
  rename(word = value) %>%
  filter(variable!="word6") %>%
  # filter(!(word %in% c("台灣","缺水","下雨","水庫","問題","缺電"))) %>% 
  filter(!(word %in% c("台灣"))) %>% 
  filter(!(word %in% stop_words), nchar(word)>1) %>% ###
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(8,n) %>% 
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE, fill="#999999") +
  xlab("「農業」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light",size=10))
AG_check_words_plot
```

+ 台積電: 台灣易缺水缺電擬移設備至南京廠

+ 農業: 農業和半導體為兩大主要用水，缺水問題影響兩大產業

# Ch.5: 主題模型的分析

## 建立LDA模型
統計每篇文章詞頻
```{r}
water_tokens <- rbind(MToken[,c("artDate", "word","artTitle")],RToken[,c("artDate","word","artTitle")]) 

# 這邊要去掉停用字，以及自建的辭典
water_artid <- water_tokens %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>% 
  count(artTitle, word) %>% 
  rename(count=n) %>% 
  mutate(artId = group_indices(., artTitle))
head(water_artid)
```


```{r}
reserved_word <- water_artid %>% 
  group_by(word) %>% 
  count() %>% 
  filter(n > 5) %>% 
  unlist()

water_artid <- water_artid %>% 
  filter(word %in% reserved_word)
```

```{r}
# 轉換為DTM
water_com_dtm <- water_artid %>% 
  cast_dtm(artId, word, count)
water_com_dtm
```
```{r}
library(LDAvis)
library(topicmodels)
# 轉為分成兩群的LDA
water_lda <- LDA(water_com_dtm, k = 2, control = list(seed = 1234))

two_topics <- tidy(water_lda, matrix = "beta")
head(two_topics)
```
```{r}
# 看分出來的兩個topic中，最常出現的詞
top_terms <- two_topics %>%
  filter(!(term %in% c("台灣"))) %>% 
  filter(!(term %in% stop_words), nchar(term)>1) %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +  # 畫圖的部分
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))

top_terms

```

兩主題之間相差最大的詞彙

正越大表示越傾向主題二，負越大越傾向主題一，

```{r}
beta_spread <- two_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .0004 | topic2 > .0004 ) %>%
  mutate(log_ratio = log2(topic2 / topic1))

topic_ratio <- rbind(beta_spread %>% 
                         top_n(10,wt = log_ratio), 
                       beta_spread %>% 
                         top_n(-10, log_ratio)) %>%
  arrange(log_ratio)

topic_ratio %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
  geom_bar(stat="identity") + 
  xlab("Word")+
  coord_flip() +
  theme(text = element_text(family = "Heiti TC Light"))
```

LDAvis

只分為兩個主題出來的結果並不是很明確，這裡改成分為三個主題。
```{r}
topicmodels_json_ldavis <- function(fitted, doc_term){
    require(LDAvis)
    require(slam)
    phi <- as.matrix(posterior(fitted)$terms)
    theta <- as.matrix(posterior(fitted)$topics)
    vocab <- colnames(phi)
    term_freq <- slam::col_sums(doc_term)
    json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                            vocab = vocab,
                            doc.length = as.vector(table(doc_term$i)),
                            term.frequency = term_freq)
    return(json_lda)
}
```

```{r}
# water_ldavis <- LDA(water_com_dtm, k = 3, control = list(seed = 1234))
# json_res <- topicmodels_json_ldavis(water_ldavis,water_com_dtm)
# serVis(json_res, open.browser = T)
```
![](topic1.png)
![](topic2.png)
![](topic3.png)


# Ch.6：其他

我們分析了Twitter上有關 #Taiwan 和 #drought 作為關鍵字的貼文

![](twitter.png)


發現大家關注的焦點主要是在，晶片和半導體

# Ch.7：結論

+ 大家的討論主要還是負面情緒居多，負面情緒的來源主要是希望政府可以對於缺水的情況有更多的作為，以及對於未來水情的擔憂。
+ 各個縣市對於缺水狀況關注的重點，各有不同。
+ 全球對於台灣缺水狀況的反映，主要是對於全球晶片產能的擔憂。
