社群媒體分析_期末報告

組別成員：

N064220018 王惠蘭
N064220019 王珮芬
N064220025 黃雅琳
N074020011 李幸珊

前言：

真相可能超乎你的想像！根據健保署公布近年國人使用醫療費用最多之前20大疾病排行榜，並就其中醫界已具共識與空汙高度相關者，2016年有呼吸系統其他疾病、呼吸和胸內器官惡性腫瘤及慢性下呼吸道疾病入榜，費用合計292億元。更甚者，近年國人每年死於肺癌及慢性下呼吸道疾病者已逾1萬5000人，以醫學臨床已知肺癌、慢性下呼吸道與空汙的相關性，若說空汙年奪逾萬國人性命，也不為過！依環保署所提供資料，綜合各種污染物排放濃度計算出空氣品質指數（Air Quality Index, AQI），當AQI超過100時即不適於敏感族群。若以AQI超過100的日數比例為「空氣品質」指標，超標愈多代表空氣品質愈差，若將超標日數增減的比例定義為「改善程度」，增加比例愈高代表空氣品質惡化。

所以在我們的想法，反空汙應是大家的共識！但是，依本次公投統計資料顯示：高雄市的空氣品質是全台最糟，超標日數達3成，「反空污」支持強度卻低於全國平均，在全台（不計金馬）排名第14。支持強度排名全台（不計金馬）第3的花蓮，去年空氣品質其實排名全台第二，僅次於台東。較支持「反燃煤」的縣市，其空氣品質較好，例如花蓮與台東；又如嘉義市的空污是全台第三嚴重，支持強度卻是全台倒數第三。這與中研院社會所研究員吳齊殷透過去年大選前後進行的「台灣社會意向調查」分析結果相似，而他認為實際的空氣品質與民眾的真實感受仍有落差。

因此我們希望透過分析政府數據、報紙、PTT討論等三部分來分析國人對於空汙認知與態度

先安裝需要的packages，並匯入

Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼

packages = c("dplyr", "tidytext", "stringr", "wordcloud2", "ggplot2",'readr',
             'data.table','reshape2','wordcloud','tidyr','scales','ggraph',
             'jiebaR','igraph','NLP','widyr', "topicmodels", "LDAvis",
             "webshot", "htmlwidgets", "tm")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

library(dplyr)
library(stringr)
require(tidytext)
require(jiebaR)
library(wordcloud2)
require(NLP)
require(data.table)
require(ggplot2)
require(reshape2)
require(wordcloud)
require(tidyr)
require(readr)
require(scales)
require(ggraph)
require(igraph)
require(widyr)
require(topicmodels)
require(LDAvis)
require(webshot)
require(htmlwidgets)
require(tm)

設定資料夾

setwd('C:/Users/XA/Documents/R/Air pollution')
getwd()

導入關鍵字

Key_wd <- c("紫爆","空汙", "空氣汙染","空氣品質","核能", "火力", "用肺發電",
            "缺電", "環保", "過濾", "肺癌", "口罩", "發電", "過敏", "眼睛痛", 
            "霧霾", "塵霾" ,"PM2.5", "懸浮微粒", "油煙","癌症","肺癌", 
            "空氣糟", "空氣好","症狀", "塵蹣", "風力", "台電", "以核養綠",
            "擁核", "反核")

Part I: 台灣電力供需

我們透過台電的公開資訊，抓取台電1982-2018年的各機組發電量

能源需求則是從2000-2018年

#台電發電量表單
Power_Generation <- read.csv('./taipower_gen.csv')
Power_Generation$年度 = as.numeric(Power_Generation$年度)

#能源需求表單
Power_demand <- read.csv('./total_energy_demand.csv')
Power_demand$year = as.numeric(Power_demand$year)

#台電發電量
ggplot_power <- Power_Generation %>%
  ggplot(aes(x=年度, y)) +
  geom_bar(mapping= aes(x=年度,y=(燃油.台電.+燃煤.台電.+燃氣.台電.)/1000000000), stat = "identity", color="blue")+
  geom_bar(mapping= aes(x=年度, y=核能.台電./1000000000), stat = "identity",color="red")+
  geom_vline(xintercept = as.numeric("2011"), col='green') + #日本311地震後
  ylab("發電量(單位:10億)")+
  ggtitle("台電發電量發電量")
ggplot_power

#能源需求
ggplot(Power_demand, aes(x=year, y=合計/100000000)) +
  geom_line(stat = "identity", color= "green")+
  ylab("能源需求(單位:億)")+  
  ggtitle("國內能源需求總量")

資料來源(台灣電力公司)：https://www.taipower.com.tw/tc/page.aspx?mid=96

紅色的是核能，藍色的是燃煤。可以看到台灣的發電結構，在日本311地震後，有顯著的改變，核能逐年遞減，火力則是逐年遞增。

依據ExxonMobil在「Outlook for Energy-A View to 2030」當中的闡述，未來全球能源和環境的情景：「至2030年，由於經濟復甦與成長，加上人口成長和生活標準提高，每年能源需求成長率將向上推升1.2％；至2030年，全球能源需求將比2005年增加近35％，惟前提是假設能源效率有重大的提升。若無，則2030年能源需求將比2005年約高95％」

因此能源需求，不意外的從2000後，急速增加，並且仍在持續增加中。

Part II：空氣品質

#高雄空品
AirQ_108_K <- read.csv('./AriQuality/107年前鎮站_2019.csv')
AirQ_107_K <- read.csv('./AriQuality/106年前鎮站_2018.csv')
AirQ_106_K <- read.csv('./AriQuality/105年前鎮站_2017.csv')
AirQ_105_K <- read.csv('./AriQuality/104年前鎮站_2016.csv')
AirQ_099_K <- read.csv('./AriQuality/099年前鎮站_2010.csv')
AirQ_094_K <- read.csv('./AriQuality/094年前鎮站_2005.csv')
AirQ_089_K <- read.csv('./AriQuality/089年前鎮站_2000.csv')
AirQ_084_K <- read.csv('./AriQuality/084年前鎮站_1995.csv')
AirQ_079_K <- read.csv('./AriQuality/079年鳳山站_1990.csv')

#台中空品
AirQ_108_T <- read.csv('./AriQuality/107年沙鹿站_2019.csv')
AirQ_107_T <- read.csv('./AriQuality/106年沙鹿站_2018.csv')
AirQ_106_T <- read.csv('./AriQuality/105年沙鹿站_2017.csv')
AirQ_105_T <- read.csv('./AriQuality/104年沙鹿站_2016.csv')
AirQ_099_T <- read.csv('./AriQuality/099年沙鹿站_2010.csv')
AirQ_094_T <- read.csv('./AriQuality/094年沙鹿站_2005.csv')
AirQ_089_T <- read.csv('./AriQuality/089年沙鹿站_2000.csv')
AirQ_084_T <- read.csv('./AriQuality/084年沙鹿站_1995.csv')
AirQ_079_T <- read.csv('./AriQuality/079年台中站_1990.csv')

#花蓮空品
AirQ_108_H <- read.csv('./AriQuality/107年花蓮站_2019.csv')
AirQ_107_H <- read.csv('./AriQuality/106年花蓮站_2018.csv')
AirQ_106_H <- read.csv('./AriQuality/105年花蓮站_2017.csv')
AirQ_105_H <- read.csv('./AriQuality/104年花蓮站_2016.csv')
AirQ_099_H <- read.csv('./AriQuality/099年花蓮站_2010.csv')
AirQ_094_H <- read.csv('./AriQuality/094年花蓮站_2005.csv')
AirQ_089_H <- read.csv('./AriQuality/089年花蓮站_2000.csv')
AirQ_084_H <- read.csv('./AriQuality/084年花蓮站_1995.csv')
AirQ_079_H <- read.csv('./AriQuality/079年花蓮站_1900.csv')

AirQ_K <- rbind(AirQ_108_K ,AirQ_107_K, AirQ_106_K, AirQ_105_K,
                AirQ_099_K, AirQ_094_K, AirQ_089_K, AirQ_084_K, AirQ_079_K) %>%
  select("日期", "測站", "測項", "Aver")
AirQ_K$日期 = as.Date(AirQ_K$日期)

AirQ_T <- rbind(AirQ_108_T ,AirQ_107_T, AirQ_106_T, AirQ_105_T,
                AirQ_099_T, AirQ_094_T, AirQ_089_T, AirQ_084_T, AirQ_079_T) %>%
  select("日期", "測站", "測項", "Aver")
AirQ_T$日期 = as.Date(AirQ_T$日期)

AirQ_H <- rbind(AirQ_108_H ,AirQ_107_H, AirQ_106_H, AirQ_105_H,
                AirQ_099_H, AirQ_094_H, AirQ_089_H, AirQ_084_H, AirQ_079_H) %>%
  select("日期", "測站", "測項", "Aver")
AirQ_H$日期 = as.Date(AirQ_H$日期)

資料來源(環保署):https://taqm.epa.gov.tw/taqm/tw/b0201.aspx

資料來源(環保署):https://taqm.epa.gov.tw/taqm/tw/YearlyDataDownload.aspx

分析台中(沙鹿), 高雄(前鎮)空氣品質, 並以花蓮做為基準

#針對PM2.5指標，增加AQI(空氣品質指標)
AirQ_PM2.5_total <- rbind(AirQ_K, AirQ_T, AirQ_H) %>%
  filter ( 測項 == "PM2.5") %>%
  #當每天平均值低於35.4，就為Good的；35.5-54.4為Bad；54.5以上則是Dangerous
  mutate(AQI = ifelse(Aver<35.4, "Good",
                      ifelse(Aver<54.4, "Bad", "Dangerous")))

#由於PM2.5是近10年來才加入偵測監控的值，因此我們額外對PM10製作AQI(空氣品質指標)
AirQ_PM10_total <- rbind(AirQ_K, AirQ_T, AirQ_H) %>%
  filter ( 測項 == "PM10") %>%
  #當每天平均值低於125，就為Good的；126-254為Bad；255上則是Dangerous
  mutate(AQI = ifelse(Aver<125, "Good",
                      ifelse(Aver<254, "Bad", "Dangerous")))

#計算平均值
StatMeanLine <- ggproto("StatMeanLine", Stat,
                        compute_group = function(data, scales) {
                          transform(data, yintercept=mean(y))
                        },
                        required_aes = c("x", "y"))

stat_mean_line <- function(mapping = NULL, data = NULL, geom = "hline",
                           position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...) {
  layer(
    stat = StatMeanLine, data = data, mapping = mapping, geom = geom,
    position = position, show.legend = show.legend, inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}

AirQ_ggplot <- AirQ_PM2.5_total %>%
  filter(日期> "2015-01-01")%>%
  ggplot(aes(x=日期, y=Aver, color= 測站))+
  geom_line(stat = "identity")+
  stat_mean_line(color="red") +
  facet_wrap(~ 測站, ncol = 1, scales = "fixed")+
  scale_x_date(date_breaks = "3 month")+
  scale_y_continuous(name="Aver PM 2.5", limits=c(0, 100))+
  theme(axis.text.x = element_text(angle=90, hjust = 1)) # x軸刻度翻轉90度
AirQ_ggplot

從2015年開始，可以看到高雄PM2.5濃度平均約為25μ/m3，台中則是略低於25μ/m3，而花蓮在10μ/m3左右。而南部地區(高雄前鎮)可以看到PM2.5的起伏隨著季節浮動，1月為高峰，7月為最低，與一般認知(季風)大致吻合。

我們再針對各地區AQI達到Bad/Dangerous的天數進行分析

#AQI 空氣品質不良的天數
AirQ_AQI_ggplot <- AirQ_PM2.5_total %>%
  group_by(AQI) %>%
  mutate(year = format(as.Date(日期), "%Y"))%>%
  count(AQI, 測站, year) %>%
  ggplot(aes(x=year, y=n, fill=測站), color=測站)+
  geom_bar(stat = "identity", position="dodge")+
  facet_wrap(~ AQI, ncol = 1, scales = "fixed")+
  xlab("年份")+
  ylab("天數")+
  ggtitle("每年空氣品質天數-PM2.5")+
  theme(axis.text.x = element_text(angle=90, hjust = 1)) # x軸刻度翻轉90度
AirQ_AQI_ggplot

AirQ_AQI_PM10 <- AirQ_PM10_total %>%
  group_by(AQI) %>%
  mutate(year = format(as.Date(日期), "%Y"))%>%
  count(AQI, 測站, year) %>%
  ggplot(aes(x=year, y=n, fill=測站), color=測站)+
  geom_bar(stat = "identity", position="dodge")+
  facet_wrap(~ AQI, ncol = 1, scales = "fixed")+
  xlab("年份")+
  ylab("天數")+
  ggtitle("每年空氣品質天數-PM10")+
  theme(axis.text.x = element_text(angle=90, hjust = 1)) # x軸刻度翻轉90度
AirQ_AQI_PM10

針對火力發電廠所在區域：高雄(前鎮站)、台中(沙鹿站)，取2005到2018年的資料來統計，統計出每年相關指標達到Good/Bad/Dangerous的天數，並以花蓮做為對比。

可以看到，不論是PM2.5，或是PM10，空氣品質似乎是逐漸改善，看來似乎並不與火力發電場的發電量有直接相關性。這似乎與我們的預期有差異，那麼國人這些年來覺得空汙日益嚴重的輿論到底是怎麼來的呢？

Part III：報紙分析

我們透過中山管院的文字分析平台，尋找2015-2019/04/30，針對“頭版”、“國際”、“生活”、“論壇”四個版來撈相關新聞報導。

抓取的關鍵字是“空氣”、“空汙”、“火力”、“發電”、“霧霾”、“核電”、“核能”。

#先匯入csv檔
txt_News <- fread('./AirPollution_news_articleMetaData.csv', encoding = "UTF-8")%>%
  mutate(sentence=gsub("[\n]{2,}", "。", sentence))
txt_News$artDate = as.Date(txt_News$artDate)

n_distinct(txt_News$artUrl) #文章數量

## [1] 5839

#新聞對比日期
ggplot_news_date <- txt_News %>%
  mutate(month = format(as.Date(artDate), "%Y%m")) %>%
  group_by(month) %>%
  mutate(datecount = n())%>%
  select(month, datecount)%>%
  unique()%>%
  ggplot(aes(x = month, y=datecount)) +
  geom_bar(stat = "identity") +
  xlab("日期")+
  ylab("新聞數")+
  ggtitle("每月新聞數")+
  theme(axis.text.x = element_text(angle=90, hjust = 1)) # x軸刻度翻轉90度
ggplot_news_date

近4年的相關新聞，每月都有100則左右的相關新聞，並不存在近一年較高的情況。最高峰落在從2018-11，估計是與公投案“以核養綠”有關。

另外可能也受到311核災影響，每年3月相關新聞數量也有微幅上升。

接著進行新聞內容斷句分析

#jieba字典
jieba_tokenizer = worker()
jieba_tokenizer <- worker(stop_word = "stop_words.txt") #加入stop_words
new_user_word(jieba_tokenizer, Key_wd) #把key_point加入字典

#新增斷句工具
air_jieba <- function(t) {
  lapply(t, function(x) {
    if(nchar(x)>1){
      tokens <- segment(x, jieba_tokenizer)
      # 去掉字串長度<1的詞彙
      tokens <- tokens[nchar(tokens)>1]
      return(tokens)
    }
  })
}

# 以全形或半形 驚歎號、問號、分號 以及 全形句號 ???依據進行斷句
txt_news_word <- strsplit(txt_News$sentence,"[。！；？!?;]")

# 將每句句子，與他所屬的文章連結配對起來，整理成一個dataframe
txt_news_word <- data.frame(
  artUrl = rep(txt_News$artUrl, sapply(txt_news_word, length)), 
  sentence = unlist(txt_news_word)
) %>%
  filter(!str_detect(sentence, regex("^(\t|\n| )*$")))

txt_news_word$sentence <- as.character(txt_news_word$sentence)

newtokens <- txt_News %>%
  unnest_tokens(word, sentence, token = air_jieba) %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
  count(artUrl, word, artDate, artCat) %>%
  rename(count=n)
newtokens

ABCDEFGHIJ0123456789

artUrl <chr>	word <chr>	artDate <date>
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	一旁	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	一頓飯	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	一趟	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	口罩	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	土地	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	子女	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	小時	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	山間	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	山路	2015-01-01
http://tw.news.appledaily.com/headline/daily/20150102/36302689/	已經	2015-01-01

新聞文字雲

newtokens %>%
  group_by(word) %>%
  summarise(sum = sum(count)) %>%
  arrange(desc(sum)) %>%
  wordcloud2()

新聞中，談論到最多的詞彙是“台灣”、“空氣”、“政府”、“中國”、“環保”、“日本”、“台電”、“美國”等，除了與能源政策、空氣品質議題較相關之外，比較值得提的是亞洲國家和美國也是大量出現，報紙報導多與國際議題較有關係。

#加入文章識別ID
news_id <- newtokens %>%
  mutate(artId = group_indices(., artUrl))

#對比新聞與關鍵字
news_keyword <- news_id %>%
  filter(word %in% Key_wd) %>%
  group_by(word) %>%
  mutate( newcount = n()) %>%
  count(newcount, artId, word, sort = TRUE)

#關鍵字在新聞的出現頻率
ggplot_news_keyword <- news_keyword %>%
  group_by(word) %>%
  select(word, newcount) %>%
  unique()%>% #移除重複的詞
  filter(newcount>20)%>%
  ungroup() %>% 
  mutate(word = reorder(word, newcount)) %>%
  ggplot(aes(x=word, y=newcount))+
  geom_bar(stat = "identity") +
  xlab("關鍵字")+
  ylab("新聞數")+
  ggtitle("關鍵字出現頻率")+
  coord_flip()
ggplot_news_keyword

環保、發電、火力、空氣品質是新聞的前四大關鍵字，或許也是新聞4年來的推波助瀾下，讓國人對於反核延伸的能源議題有更多的接收管道。

進一步分析新聞文字間的Network

#將關鍵字分為"能源"、"健康"兩個群組
Key_wd_energy <- c("核能", "火力", "發電", "風力", "台電",
                   "以核養綠", "擁核", "反核")
key_wd_health <- c("環保", "過濾", "肺癌", "口罩", "過敏", "眼睛痛",
                   "霧霾", "塵霾" ,"PM2.5", "懸浮微粒", "油煙","癌症","肺癌",
                   "空氣糟", "空氣好","症狀", "塵蹣")

term_cooccurrence = news_id %>%
  filter(word %in% Key_wd) %>%
  pairwise_count(word, artId , sort = TRUE, diag=F) #計算字詞配對數

term_cooccurrence = as.data.frame(term_cooccurrence)

#移除重複的pairwise
for (i in 1:nrow(term_cooccurrence)){
  term_cooccurrence[i, ] = sort(term_cooccurrence[i,])
}
term_cooccurrence=term_cooccurrence[!duplicated(term_cooccurrence),]
names(term_cooccurrence)=c('weight','item1','item2')
term_cooccurrence=term_cooccurrence %>%  select(item1,item2,weight) %>%
  #將item1與兩個群組對比，如果屬於能源相關為紅色；健康相關為藍色；其他則為黃色。
  mutate(color = ifelse(item1 %in% Key_wd_energy, "red",
                    ifelse(item1 %in% key_wd_health, "blue", "yellow")))
term_cooccurrence$weight=as.numeric(term_cooccurrence$weight)

#畫出文字間的網路圖
g = term_cooccurrence %>%
  graph_from_data_frame(directed = F)

V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

#定義node的顏色
V(g)$color=sapply(names(V(g)), function(v){
  ifelse(term_cooccurrence$color == "red", '#FF0000',
         ifelse(term_cooccurrence$color == "blue", '#77FFEE', '#FFFF00'))
})

set.seed(5500)
layout1 <- layout_randomly(g) #設置圖的佈局方式為"隨機布局"
layout2 <- layout.fruchterman.reingold(g) #設置圖的佈局方式為"彈簧式發散"

#以Degree作為頂點大小
deg <- degree(g, mode="all")
plot(g, layout=layout1, vertex.size=deg*1.1, vertex.label.ces=.5)
#頂點大小為degree*1.1
#legent:標示
legend("topleft",  #表示這個標示的位置
       c('health','energy'),
       pch=21, #pch代表點的圖案
       col="#777777",
       pt.bg=c("#77FFEE","#FF0000","#FFFF00"),
       pt.cex=1, cex=0.7)

圈圈即為觸及的議題廣度，越大表示越多議題與該關鍵字有關。這裡可看到“空氣品質”、 “過敏”、“懸浮微粒”、“環保”、“肺癌、”紫爆“等都屬於比較熱門的詞彙。

#以betweenness作為頂點大小
bet <- betweenness(g, directed=F, weights=NA, normalized = T)
plot(g, layout=layout1, vertex.size=bet*500, vertex.label.ces=.4)
legend("topleft", c('health','energy'), pch=21,
       col="#777777", pt.bg=c("#FF0000","#77FFEE","#FFFF00"), pt.cex=1, cex=0.7)

Betweenness為中介度，越高表示該詞為其他詞彙的串接橋梁。以上圖來說，“空氣品質”、“懸浮微粒”、“環保”、“過敏”都屬於中介度高的。

#Closeness接近中心度
clo <- closeness(g, mode="all" , weights=NA, normalized=T)
plot(g, layout=layout2, vertex.size=clo*20)
legend("topleft", c('health','energy'), pch=21,
       col="#777777", pt.bg=c("#FF0000","#77FFEE","#FFFF00"), pt.cex=1, cex=0.8)

最中心的幾個關鍵字，可以理解為核心議題，像是“環保”(紅色)、“紫爆”(藍色)、“懸浮微粒”(紅色)、“癌症”(藍色)、“肺癌”(紅色)、“過敏”(藍色)。

而2018的公投“以核養綠”(紅色)，衍伸出來的“用肺發電”、“核能”、“擁核”、“反核”都圍繞在“以核養綠”附近。

Part IV：Topic Modeling

我們繼續以新聞進行topic分析

news_dtm <-news_id %>%
  cast_dtm(artId, word, count)

inspect(news_dtm[1:10,1:10])

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 11/89
## Sparsity           : 89%
## Maximal term length: 3
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs 一旁 一頓飯 一趟 口罩 土地 子女 小時 山間 山路 已經
##   1     1      1    1    1    1    1    1    1    1    1
##   10    0      0    0    0    0    0    0    0    0    0
##   2     0      0    0    0    0    0    0    0    0    0
##   3     0      0    0    0    0    0    0    0    0    0
##   4     0      0    0    0    0    0    0    0    0    0
##   5     0      0    0    0    0    0    0    0    0    0
##   6     0      0    0    0    0    0    0    0    0    0
##   7     0      0    0    0    0    0    0    0    0    0
##   8     0      0    0    0    0    0    0    0    0    0
##   9     0      0    0    0   12    0    0    0    0    0

#建立LDA模型，並且分成2個主題
news_lda <- LDA(news_dtm, 
                k = 2, # number of topics
                control = list(seed = 5555))

#查看ϕ matrix (topic * term)
news_topics <- tidy(news_lda, matrix = "beta") 
news_topics

ABCDEFGHIJ0123456789

topic <int>	term <chr>	beta <dbl>
1	一旁	5.711228e-05
2	一旁	6.533902e-05
1	一頓飯	4.009565e-06
2	一頓飯	7.721620e-22
1	一趟	9.491924e-05
2	一趟	7.335615e-06
1	口罩	8.952470e-04
2	口罩	7.380083e-05
1	土地	6.599711e-05
2	土地	4.631927e-04

尋找Topic的代表字

news_top_terms <- news_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

news_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

移除掉中性字，再來分析一次

remove_words <- c("使用", "表示", "報導", "台灣") #移掉中性字
news_top_terms_remove <- news_topics %>%
  filter(! term %in% remove_words) %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

news_top_terms_remove %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

兩種topic中，紅色這群屬與健康相關的關鍵字，包含“過敏”、“醫師”、“空氣”、“生活”M、“環境”等；而藍色的屬於能源相關的議題，包含“政府”、“台電”、“中國”、“美國”、“日本”、“發電”、“電廠”。

Part V：PTT 文章分析

最後，我們以PTT鄉民的討論度來簡單分析青中年(25-40歲)的國人態度。

我們同樣透過中山管院的文字分析平台，搜尋2015-2019/04/30的PTT版上台中版跟高雄版的鄉民文章。

同樣的抓取關鍵字是“空氣”、“空汙”、“火力”、“發電”、“霧霾”、“核電”、“核能”。

#匯入PTT文章csv
txt_ptt <- fread('./Airpollution_articleMetaData.csv', encoding = "UTF-8")%>%
  mutate(sentence=gsub("[\n]{2,}", "。", sentence))
txt_ptt$artDate = as.Date(txt_ptt$artDate)

#LIWC情緒字典
p <- read_file("./positive.txt")
n <- read_file("./negative.txt")
positive <- strsplit(p, "[,]")[[1]]
negative <- strsplit(n, "[,]")[[1]]
positive <- data.frame(word = positive, sentiments = "positive")
negative <- data.frame(word = negative, sentiemtns = "negative")
colnames(negative) = c("word","sentiment")
colnames(positive) = c("word","sentiment")
LIWC_ch <- rbind(positive, negative)

第一部分，文章的情緒統計

#進行斷句
ptttokens <- txt_ptt %>%
  unnest_tokens(word, sentence, token = air_jieba) %>%
  filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
  count(artUrl, word, artDate, artCat, artPoster, push, boo) %>%
  rename(count=n)
ptttokens

ABCDEFGHIJ0123456789

artUrl <chr>	word <chr>	artDate <date>	artCat <chr>
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	一帶	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	今天	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	戶外活動	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	另一側	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	本來	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	地方	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	好吃	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	好玩	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	走走	2015-02-05	Kaohsiung
https://www.ptt.cc/bbs/Kaohsiung/M.1423124931.A.520.html	那算內	2015-02-05	Kaohsiung

#PTT上每月的情緒字出現量 (Positive/Negative)
plot_sent_ptt <- ptttokens %>%
  select(artDate,word ,count) %>%
  inner_join(LIWC_ch) %>%
  mutate(DateYM = format(as.Date(artDate), "%Y/%m"))%>%
  group_by(DateYM ,sentiment) %>%
  summarise(count = sum(count)) %>%
  ggplot(aes(x=DateYM, y = count, color = sentiment))+
  geom_bar(stat = "identity", position="dodge")+
  xlab("Date")+
  theme(axis.text.x = element_text(angle=90, hjust = 1)) # x軸刻度翻轉90度
plot_sent_ptt

情緒最高落在2017-11月左右，這與2017年11月台中舉辦，高雄人高參與的反空汙大遊行，應該有正相關。第二高峰則是2018-11月，應該與“以核養綠”公投有關。

第二部分，計算文字情緒

#計算所有word出現次數
word_count <- ptttokens %>%
  select(word,count) %>%
  group_by(word) %>%
  summarise(count = sum(count))  %>%
  filter(count>3)%>%
  inner_join(LIWC_ch)
word_count

ABCDEFGHIJ0123456789

word <chr>	count <int>	sentiment <fctr>
八卦	15	negative
不好	75	negative
不利	4	negative
不足	20	negative
不爽	5	negative
不當	4	negative
不適	11	negative
不錯	38	positive
允諾	10	positive
公平	9	positive

統計正面/負面情緒最常出現的字

ggplot_word_count <- word_count %>%
  inner_join(LIWC_ch) %>%
  group_by(sentiment) %>%
  top_n(15,wt = count) %>%
  ungroup() %>%
  mutate(word = reorder(word, count)) %>%
  ggplot(aes(word, count, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14))+
  coord_flip()
ggplot_word_count

正向的字眼有“改善”、“健康”、“乾淨”等；負面的則以“嚴重”、“惡化”、“敏感”等，可發現鄉民關心的議題多與健康較相關，與能源議題較無直接關係。

延續上面的統計，做出情緒文字雲

sent_cloud <- ptttokens %>%
  inner_join(LIWC_ch) %>%
  group_by(word,sentiment) %>%
  summarise(count=sum(count)) %>%
  acast(word ~ sentiment, value.var = "count", fill = 0) %>%
  comparison.cloud(colors = c("gray80", "gray20"),
                   max.words = 100)

第三部分，分析高雄/台中版鄉民分別關心甚麼關鍵字

#高雄
city_ptt_K <- ptttokens %>%
  filter(artCat == "Kaohsiung") %>%
  group_by(word) %>%
  filter(!word %in% c("高雄", "高雄市"))%>% #把主題去掉，避免佔用排序
  summarise(sum = sum(count)) %>%
  arrange(desc(sum)) %>%
  mutate(word = reorder(word, sum)) %>%
  top_n(10)%>%
  ggplot(aes(word, sum)) +
  geom_col(show.legend = FALSE) +
  coord_flip()
city_ptt_K

####台中
city_ptt_T <- ptttokens %>%
  filter(artCat == "TaichungBun") %>%
  group_by(word) %>%
  filter(!word %in% c("台中", "台中市"))%>% #把主題去掉，避免佔用排序
  summarise(sum = sum(count)) %>%
  arrange(desc(sum)) %>%
  mutate(word = reorder(word, sum)) %>%
  top_n(10)%>%
  ggplot(aes(word, sum)) +
  geom_col(show.legend = FALSE) +
  coord_flip()
city_ptt_T

兩地鄉民其實討論的關鍵字大同小異，比較有趣的是高雄“工廠”；台中“紫爆”、“清淨機”。這或許可說明近年來對於空汙關心態度，中部是高於南部的。

第四部分，Word Correlation (文字相關性)

ptt_word_cors <- ptttokens %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, artUrl, sort = TRUE)

ptt_word_cors_filter <- ptt_word_cors %>%
  #我們選擇幾個主題來尋找各自的相關字
  filter(item1 %in% c("核能", "火力", "環保", "能源", "空汙")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation))%>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1,  ncol = 2, scales = "free") +
  coord_flip()+
  #加入中文字型設定，避免中文字顯示錯誤。
  theme(text = element_text(family = "Heiti TC Light"))
ptt_word_cors_filter

火力：其實比較單純，都圍繞在火力發電廠相關詞彙。

空汙：比較特別的是，提到“政府”、“市府”，顯見空汙議題不單是台中、高雄兩地的市府單位相關，政府政策也是與空汙息息相關。

能源：跟政府能源政策較有關，因此像是“中央”、“條例”、“發電”、“政策”等文字較相關。

環保：比較單純，文字都跟環保署監測有關。

值得一提的是“核能”竟然完全沒有相關字，我們後來又試了“核電”、“核”、“以核養綠”也都沒有辦法做出結果。

結論：

空污問題不是單純的空氣污染而已，還延伸出空污對健康造成危害的問題、影響生活品質等問題。然而空污具有高度的複雜性，目前的科學能力尚無法訂定出絕對安全的曝露劑量，而且對空污的相關科學知識掌握有限、以及空污所影響的利害相關人範圍更為廣大，橫跨的議題面向也多元，因此需要跨領域互動、協商。在這種情況下，當代的環境治理越來越重視發展更包容多元的決策模式，讓更多常民知識進場，打開科學知識與風險決策的黑盒子。臺灣空汙的惡化已促使一些在地社區不願再忍受空汙的侵襲，從地方面對的環境困境與問題意識出發，組織資訊進行倡議行動。針對傳統政府資訊無法對焦回應問題的系統性缺漏，民間團體尋求更多環境資訊的生產，傳播環境風險意識，因而產生一連串公民參與空汙監測行動。我們的分析來源顯示，這些行動所植基的提問，不僅直接挑戰官方監測數據的指標意義，也透過簡單但涵蓋感官與數據的複合資訊，如「高雄的天空」拍攝計畫與嘉義「玉山觀測」，向大眾重新詮釋空汙問題。

另一方面，政府所發布的大尺度監測數據抹平了環境差異，而無法反映地方在生活尺度上的真實空汙狀況，因而產生需要更細膩資訊的需目前發展的公民監測空汙的社區行動科學，還處於提出質問、整理資訊與傳播風險意識的階段。所生產的資訊較能提供個人避險之用，而無法回答空氣汙染物質特性、也無法釐清誰是主要汙染貢獻者等問題，因此行動上難以轉化為針對源頭系統性變革的訴求。但是，透過資訊生產彌補居民與專家間的知識落差，也確保居民可以持續表態同意與否的權利，使地方居民在充分知情下做出選擇。

高雄作為台灣重要的工業重鎮發展多年，四十多年來大型產業如鋼鐵業、電力業、石化產業不斷發展。工業所帶來的各樣廢棄物與污染問題，也越來越嚴重。高雄的空氣品質因為臭氧及懸浮微粒濃度高，空氣污染成為高雄人無法迴避的課題。進而衍生出的各種環境問題除了影響國家形象與社會觀感外，也會對人體的健康有相當程度的危害。因此要能如何有效地從源頭管制污染源與做好各項污染防制工作，進而加速改善空氣污染乃是當務之急，倘若能配合適當的行政管制措施將更能彌補技術落實上的盲點，相信更能有助於達到改善台灣空氣污染的目標。

本文期許，台灣的空氣品質治理，能透過銜接政策與民眾感知，掌握跨界風險問題點，以實證科學、經濟與政治到社會與文化架構，處理的不只是技術面，而是台灣社會的產業、能源、交通運輸與生活方式的轉型。提升台灣空氣品質，真正守護、捍衛台灣民眾的健康權，呼吸健康空氣。

社群媒體分析_期末報告

Group_7

2019/06/11

組別成員：

前言：

先安裝需要的packages，並匯入

設定資料夾

導入關鍵字

Part I: 台灣電力供需

我們透過台電的公開資訊，抓取台電1982-2018年的各機組發電量

能源需求則是從2000-2018年

Part II：空氣品質

分析台中(沙鹿), 高雄(前鎮)空氣品質, 並以花蓮做為基準

我們再針對各地區AQI達到Bad/Dangerous的天數進行分析

Part III：報紙分析

我們透過中山管院的文字分析平台，尋找2015-2019/04/30，針對“頭版”、“國際”、“生活”、“論壇”四個版來撈相關新聞報導。

抓取的關鍵字是“空氣”、“空汙”、“火力”、“發電”、“霧霾”、“核電”、“核能”。

接著進行新聞內容斷句分析

新聞文字雲

進一步分析新聞文字間的Network

Part IV：Topic Modeling

我們繼續以新聞進行topic分析

尋找Topic的代表字

移除掉中性字，再來分析一次

Part V：PTT 文章分析

最後，我們以PTT鄉民的討論度來簡單分析青中年(25-40歲)的國人態度。

我們同樣透過中山管院的文字分析平台，搜尋2015-2019/04/30的PTT版上台中版跟高雄版的鄉民文章。

同樣的抓取關鍵字是“空氣”、“空汙”、“火力”、“發電”、“霧霾”、“核電”、“核能”。

第一部分，文章的情緒統計

第二部分，計算文字情緒

統計正面/負面情緒最常出現的字

延續上面的統計，做出情緒文字雲

第三部分，分析高雄/台中版鄉民分別關心甚麼關鍵字

第四部分，Word Correlation (文字相關性)

結論：