Ch.0 : 資料取得及觀念複習

1. 資料取得及套件載入

載入的資料是由中山大學管理學院文字分析平台取得，在平台資料輸出區塊選擇「文章+詞彙+詞頻」選項，即可取得相同格式之csv檔案。

資料簡介

本資料為2019/01/01 ~ 2019/03/04 PTT八卦版之資料，透過文字分析平台檢索「還願」、「美心」兩個關鍵字，共搜尋到683篇文章。

安裝需要的packages

packages = c("dplyr","ggplot2", "data.table", "scales")
#要安裝的packages
existing = as.character(installed.packages()[,1])
#將所有已安裝的packages叫出
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
#把尚未安裝的進行安裝

a <- c(1:10)
b <- seq(1,10,2)
a %in% b

##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

a[a %in% b]

## [1] 1 3 5 7 9

require(ggplot2)
require(dplyr)
require(data.table)
require(scales)

csv <- fread("./data/social_media_0305_artWordFreq.csv", encoding = "UTF-8")

head(csv)

##                                  artTitle    artDate  artTime
## 1: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019/01/03 14:50:55
## 2: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019/01/03 14:50:55
## 3: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019/01/03 14:50:55
## 4: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019/01/03 14:50:55
## 5: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019/01/03 14:50:55
## 6: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019/01/03 14:50:55
##                                                      artUrl   word count
## 1: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html 四面佛     7
## 2: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   還願     4
## 3: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html 越來越     2
## 4: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   舞者     2
## 5: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   新莊     2
## 6: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   妹子     1

str(csv)

## Classes 'data.table' and 'data.frame':   34501 obs. of  6 variables:
##  $ artTitle: chr  "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" ...
##  $ artDate : chr  "2019/01/03" "2019/01/03" "2019/01/03" "2019/01/03" ...
##  $ artTime : chr  "14:50:55" "14:50:55" "14:50:55" "14:50:55" ...
##  $ artUrl  : chr  "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" ...
##  $ word    : chr  "四面佛" "還願" "越來越" "舞者" ...
##  $ count   : int  7 4 2 2 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

#維度34501*6，6個變數皆屬於文字

csv$artDate = csv$artDate %>% as.Date("%Y/%m/%d")
#將時間變數屬性由文字轉成時間
str(csv)

## Classes 'data.table' and 'data.frame':   34501 obs. of  6 variables:
##  $ artTitle: chr  "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" ...
##  $ artDate : Date, format: "2019-01-03" "2019-01-03" ...
##  $ artTime : chr  "14:50:55" "14:50:55" "14:50:55" "14:50:55" ...
##  $ artUrl  : chr  "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" ...
##  $ word    : chr  "四面佛" "還願" "越來越" "舞者" ...
##  $ count   : int  7 4 2 2 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

資料預設之日期欄位是“chr”格式，在畫圖前我們須先將其轉為“date”格式，轉換後可以透過str指令來確認欄位型態。

head(csv)

##                                  artTitle    artDate  artTime
## 1: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019-01-03 14:50:55
## 2: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019-01-03 14:50:55
## 3: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019-01-03 14:50:55
## 4: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019-01-03 14:50:55
## 5: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019-01-03 14:50:55
## 6: [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019-01-03 14:50:55
##                                                      artUrl   word count
## 1: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html 四面佛     7
## 2: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   還願     4
## 3: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html 越來越     2
## 4: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   舞者     2
## 5: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   新莊     2
## 6: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html   妹子     1

資料欄位

artTitle: 文章之標題，注意:不同文章可能會有完全相同的標題。

artDate: 文章發佈之日期。

artTime: 文章發佈之時間。

artUrl: 文章之網址，每篇文章之網址為獨一無二的，可用來辨識相同標題之不同文章。

word: 詞彙。

count: 詞頻。

Ch.1 日期折線圖

這個章節的目的是計算出每一天文章的發表數量，可以看出特定主題討論的熱度。

資料處理

csv %>% select(artUrl) %>% distinct() %>% dim()

## [1] 683   1

csv[,artUrl,artDate] %>% distinct() %>% dim()

## [1] 683   2

data <- csv %>% 
  select(artDate, artUrl) %>% 
  distinct()
data %>% dim()

## [1] 683   2

由於這份資料的每一列是特定文章的每一個詞彙，我們只需要文章以及日期兩個欄位即可，其他重複欄位可以去除。(一篇文章有很多個詞彙，所以會有很多列，但我們只需要保留一個URL即可)。

str(data)

## Classes 'data.table' and 'data.frame':   683 obs. of  2 variables:
##  $ artDate: Date, format: "2019-01-03" "2019-01-13" ...
##  $ artUrl : chr  "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1547383858.A.D14.html" "https://www.ptt.cc/bbs/Gossiping/M.1547387269.A.3A7.html" "https://www.ptt.cc/bbs/Gossiping/M.1547391016.A.E55.html" ...
##  - attr(*, ".internal.selfref")=<externalptr>

article_count_by_date <- data %>% 
  group_by(artDate) %>% 
  summarise(count = n())

head(article_count_by_date, 20)

## # A tibble: 20 x 2
##    artDate    count
##    <date>     <int>
##  1 2019-01-03     1
##  2 2019-01-13     3
##  3 2019-01-14     2
##  4 2019-01-15     1
##  5 2019-01-18     1
##  6 2019-01-20     1
##  7 2019-02-01     1
##  8 2019-02-10     1
##  9 2019-02-12     2
## 10 2019-02-17     2
## 11 2019-02-18     6
## 12 2019-02-19    79
## 13 2019-02-20    33
## 14 2019-02-21    24
## 15 2019-02-22    65
## 16 2019-02-23   157
## 17 2019-02-24   126
## 18 2019-02-25    91
## 19 2019-02-26    36
## 20 2019-02-27    20

按照日期分群，計算每天共有幾篇討論文章。

  article_count_by_date %>% 
  # aesthetics
  ggplot(aes(x = artDate, y = count)) +
  # geometrics
  geom_step(color = "#00AFBB", size = 0.5)

article_count_by_date$artDate <- article_count_by_date$artDate %>% as.Date(format="%Y/%m/%d")
str(article_count_by_date$artDate)

##  Date[1:23], format: "2019-01-03" "2019-01-13" "2019-01-14" "2019-01-15" "2019-01-18" ...

plot_date <- 
  # data
  article_count_by_date %>% 
  # aesthetics
  ggplot(aes(x = artDate, y = count)) +
  # geometrics
  geom_line(color = "#00AFBB", size = 2) + 
  geom_vline(xintercept = as.numeric(as.Date("2019-02-15")), col='red') + 
  #在時間2019-02-15加上紅線
  ggtitle("還願 討論文章數") + 
  xlab("日期") + #y軸名稱
  ylab("數量") + #x軸名稱
  # theme
  theme(text = element_text(family = "Heiti TC Light")) #加入中文字型設定，避免中文字顯示錯誤。

plot_date

> 從上圖中可以看到關於「還願」的討論從02/15開始加溫，對照維基百科簡介可得知此款遊戲於2019/02/19正式上市。

Ch.2 文字雲

接下來我們來大略觀察討論的內容為何，使用的方式為文字雲。

str(csv)

## Classes 'data.table' and 'data.frame':   34501 obs. of  6 variables:
##  $ artTitle: chr  "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" ...
##  $ artDate : Date, format: "2019-01-03" "2019-01-03" ...
##  $ artTime : chr  "14:50:55" "14:50:55" "14:50:55" "14:50:55" ...
##  $ artUrl  : chr  "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" ...
##  $ word    : chr  "四面佛" "還願" "越來越" "舞者" ...
##  $ count   : int  7 4 2 2 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

data <- csv %>% 
  group_by(word) %>% 
  summarise(sum = sum(count)) %>% 
  arrange(desc(sum))

先將資料集中所有文章按照文字進行分群，計算每一個字的總詞頻。

head(data)

## # A tibble: 6 x 2
##   word    sum
##   <chr> <int>
## 1 還願   1367
## 2 遊戲   1280
## 3 中國    587
## 4 台灣    428
## 5 赤燭    417
## 6 網友    306

結果為總詞頻最多的字。

require(wordcloud2)
data %>% wordcloud2()

data %>% wordcloud2(shape = "square",
                    backgroundColor='Lightblue',
                    size = 0.2)

#文字雲調被景色與形狀

將整理好的資料直接送入wordcloud2 function即可得到文字雲。此套件為互動式介面，使用滑鼠移動到特定的詞彙，畫面會同時顯示詞彙以及對應的詞頻。

按照日期進行區分

在Ch.1中的日期折線圖中，我們發現在2/22前後討論數量有一波轉折，我們來看看2/22前後的討論內容是否有不同的地方。

data_before <- csv %>% filter(artDate <= "2019-02-22")
data_after <- csv %>% filter(artDate > "2019-02-22")

按照日期切分資料

data_before <- data_before %>% 
  group_by(word) %>% 
  summarise(sum = sum(count)) %>% 
  arrange(desc(sum))

data_after <- data_after %>% 
  group_by(word) %>% 
  summarise(sum = sum(count)) %>% 
  arrange(desc(sum))

與Ch.2相同的資料處理。

plot_before <- data_before %>% wordcloud2()
plot_before

> 2/22 之前的文字雲

plot_after <- data_after %>% wordcloud2()
plot_after

> 2/22之後的文字雲

觀察兩張圖片可以發現，在2/22以前，主要討論為「實況」、「劇情」、「美心」等遊戲本身內容，或是與「返校」等相關作品之比較。在2/22以後，開始出現大量「中國」、「台灣」、「習近平」、「符咒」、「小熊維尼」等關鍵字。對照維基百科簡介可得知此款遊戲於2019/2/21爆發符咒爭議事件，導致討論焦點開始模糊。

練習練習～

可以發現本次資料集是使用「劇情」、「美心」為關鍵字抓的，所以理所當然「劇情」與「美心」的詞頻會很高。
那麼就請各位將這兩個關鍵字移除後重新繪製文字雲吧！

head(data)

## # A tibble: 6 x 2
##   word    sum
##   <chr> <int>
## 1 還願   1367
## 2 遊戲   1280
## 3 中國    587
## 4 台灣    428
## 5 赤燭    417
## 6 網友    306

#data[,1]=="還願"
newdata <- data[-(data$word==c("還願","遊戲")),]

Ch3. 長條圖

文字雲可以直覺看出較常提到的字，但如果想得到精確的「最常出現詞彙」，我們則可以透過長條圖來查看。

data <- rbind(data_before %>% mutate(type="before"), data_after %>% mutate(type="after"))

將2/22前後的資料加入一個欄位標示後合併起來。

除了詞彙及詞頻以外，還多了一個type欄位來區分2/22以前或是以後。

data %>% 
  group_by(type) %>% 
  top_n(15, sum) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, sum))

## # A tibble: 31 x 3
##    word     sum type  
##    <fct>  <int> <chr> 
##  1 還願     412 before
##  2 遊戲     367 before
##  3 台灣     128 before
##  4 實況     106 before
##  5 玩家      98 before
##  6 恐怖      80 before
##  7 返校      78 before
##  8 赤燭      72 before
##  9 劇情      70 before
## 10 有沒有    64 before
## # ... with 21 more rows

plot_merge <- data %>% 
  group_by(type) %>% 
  top_n(15, sum) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(x=word, y=sum, fill = type)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y="詞頻") +
  facet_wrap(~type, ncol = 1, scales="free") + 
  coord_flip()+
  theme(text = element_text(family = "Heiti TC Light"))

plot_merge

> 由上圖可以看到，最常出現的字仍為「還願」、「遊戲」，然而在2/22以後「符咒」、「習近平」、「小熊維尼」等詞彙討論數異常升高。

plot_merge <- data %>% 
  group_by(type) %>% 
  top_n(25, sum) %>%
  ungroup() %>% 
  #filter(!(duplicated(word)| duplicated(word, fromLast = TRUE))) #%>% 
  group_by(word) %>%
  filter(n()==1) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>% 
  ggplot(aes(word, sum, fill = type)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y="總和") +
  facet_wrap(~type, ncol = 1, scales="free") + 
  coord_flip()+
  theme(text = element_text(family = "Heiti TC Light"))

plot_merge

> 將重複的詞彙移除後，可以看到在2/22以前討論的辭彙多是與遊戲劇情有有關的，例如「實況」、「返校」、「劇情」、「直播」、「老師」、「國產」等…
但是在2/22之後，常出現的詞彙變成，「符咒」、「習近平」、「小熊維尼」、「大陸」、「下架」等。

Ch.3 情緒折線圖

透過文字分析平台可以輸出每篇文章的情緒分數，在資料預覽頁面選擇「文章+情緒」選項即可得到資料。

載入資料

csv_sen <- fread("./data/social_media_0305_artSen.csv", encoding = "UTF-8")
head(csv_sen)

##                                        artTitle    artDate  artTime
## 1:       [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019/01/03 14:50:55
## 2:                    [問卦]還願會有櫻花妹玩嗎? 2019/01/13 04:44:56
## 3:                   [問卦]求神不還願會怎麼樣？ 2019/01/13 05:41:36
## 4: [新聞]影／比《返校》更恐怖！赤燭新作《還願》 2019/01/13 06:44:11
## 5: [新聞]國產一級棒！《返校》開發團隊新作《還願 2019/01/14 00:39:48
## 6:                       [問卦]有沒有還願的八卦 2019/01/14 23:34:49
##                                                      artUrl
## 1: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html
## 2: https://www.ptt.cc/bbs/Gossiping/M.1547383858.A.D14.html
## 3: https://www.ptt.cc/bbs/Gossiping/M.1547387269.A.3A7.html
## 4: https://www.ptt.cc/bbs/Gossiping/M.1547391016.A.E55.html
## 5: https://www.ptt.cc/bbs/Gossiping/M.1547455551.A.679.html
## 6: https://www.ptt.cc/bbs/Gossiping/M.1547538053.A.C23.html
##    positive_emotion_grade negative_emotion_grade neutral_emotion_grade
## 1:                      4                      0                     1
## 2:                      1                      0                     0
## 3:                      0                      2                     0
## 4:                      2                      1                     1
## 5:                      4                      3                     1
## 6:                      1                      1                     0

資料欄位介紹：
1. artTitle: 文章之標題，注意:不同文章可能會有完全相同的標題。
2. artDate: 文章發佈之日期。
3. artTime: 文章發佈之時間。
4. artUrl: 文章之網址，每篇文章之網址為獨一無二的，可用來辨識相同標題之不同文章。
5. positive_emotion_grade: 本篇文章出現多少次正面情緒詞彙。
6. negative_emotion_grade: 本篇文章出現多少次負面情緒詞彙。
7. neutral_emotion_grade: 本篇文章出現多少次中性情緒詞彙。

csv_sen$artDate = csv_sen$artDate %>% as.Date("%Y/%m/%d")
str(csv_sen)

## Classes 'data.table' and 'data.frame':   683 obs. of  7 variables:
##  $ artTitle              : chr  "[新聞]萌妹子香汗淋漓　彎腰向四面佛還願" "[問卦]還願會有櫻花妹玩嗎?" "[問卦]求神不還願會怎麼樣？" "[新聞]影／比《返校》更恐怖！赤燭新作《還願》" ...
##  $ artDate               : Date, format: "2019-01-03" "2019-01-13" ...
##  $ artTime               : chr  "14:50:55" "04:44:56" "05:41:36" "06:44:11" ...
##  $ artUrl                : chr  "https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html" "https://www.ptt.cc/bbs/Gossiping/M.1547383858.A.D14.html" "https://www.ptt.cc/bbs/Gossiping/M.1547387269.A.3A7.html" "https://www.ptt.cc/bbs/Gossiping/M.1547391016.A.E55.html" ...
##  $ positive_emotion_grade: int  4 1 0 2 4 1 1 0 16 7 ...
##  $ negative_emotion_grade: int  0 0 2 1 3 1 0 1 9 2 ...
##  $ neutral_emotion_grade : int  1 0 0 1 1 0 0 0 2 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

與Ch.1相同，資料預設日期格式為“chr”，我們需要先將其轉為“date”格式。

csv_sen

##                                           artTitle    artDate  artTime
##   1:        [新聞]萌妹子香汗淋漓　彎腰向四面佛還願 2019-01-03 14:50:55
##   2:                     [問卦]還願會有櫻花妹玩嗎? 2019-01-13 04:44:56
##   3:                    [問卦]求神不還願會怎麼樣？ 2019-01-13 05:41:36
##   4:  [新聞]影／比《返校》更恐怖！赤燭新作《還願》 2019-01-13 06:44:11
##   5:  [新聞]國產一級棒！《返校》開發團隊新作《還願 2019-01-14 00:39:48
##  ---                                                                  
## 679:          [問卦]有生之年等的到《還願》上架嗎？ 2019-03-02 06:10:24
## 680:                [問卦]還願的廣告怎麼只看到二則 2019-03-02 06:49:58
## 681:   [新聞]誤會一場被冤枉了?網友專業神解釋還願咒 2019-03-02 06:51:00
## 682:              [問卦]還願劇情有沒不合理的地方？ 2019-03-02 07:02:18
## 683: Re:[問卦]如果美心還活著現在會過怎麼樣的生活?? 2019-03-02 08:19:05
##                                                        artUrl
##   1: https://www.ptt.cc/bbs/Gossiping/M.1546556219.A.A7A.html
##   2: https://www.ptt.cc/bbs/Gossiping/M.1547383858.A.D14.html
##   3: https://www.ptt.cc/bbs/Gossiping/M.1547387269.A.3A7.html
##   4: https://www.ptt.cc/bbs/Gossiping/M.1547391016.A.E55.html
##   5: https://www.ptt.cc/bbs/Gossiping/M.1547455551.A.679.html
##  ---                                                         
## 679: https://www.ptt.cc/bbs/Gossiping/M.1551536187.A.79D.html
## 680: https://www.ptt.cc/bbs/Gossiping/M.1551538561.A.D84.html
## 681: https://www.ptt.cc/bbs/Gossiping/M.1551538623.A.BC7.html
## 682: https://www.ptt.cc/bbs/Gossiping/M.1551539303.A.E22.html
## 683: https://www.ptt.cc/bbs/Gossiping/M.1551543910.A.952.html
##      positive_emotion_grade negative_emotion_grade neutral_emotion_grade
##   1:                      4                      0                     1
##   2:                      1                      0                     0
##   3:                      0                      2                     0
##   4:                      2                      1                     1
##   5:                      4                      3                     1
##  ---                                                                    
## 679:                      0                      0                     1
## 680:                      2                      0                     1
## 681:                      2                      9                     0
## 682:                      1                      2                     0
## 683:                      1                      1                     0

csv_sen <- csv_sen %>% filter(artDate >= "2019-02-15")

從前幾個章節得到主要討論都是由2/15後才開始，所以先把不相關的雜訊排除。

data_sen <- csv_sen %>% 
  group_by(artDate) %>% 
  summarise(positive = sum(positive_emotion_grade), negative = sum(negative_emotion_grade), neutral = sum(neutral_emotion_grade))

head(data_sen)

## # A tibble: 6 x 4
##   artDate    positive negative neutral
##   <date>        <int>    <int>   <int>
## 1 2019-02-17        0        1       0
## 2 2019-02-18        3        7       1
## 3 2019-02-19      132      102      17
## 4 2019-02-20       78       67      13
## 5 2019-02-21       34       28       5
## 6 2019-02-22      102      122      13

將資料按照日期分群後，加總每天的三種情緒分數，並進行標準化。

data_sen %>% ggplot(aes(x= artDate)) +
  geom_line(aes(y = positive, colour = "red")) +
  geom_line(aes(y = negative, colour = "blue")) +
  geom_line(aes(y = neutral, colour = "yellow")) + 
  scale_x_date(labels = date_format("%Y/%m/%d")) +#時間格式轉換
  scale_color_discrete(name="情緒種類", labels = c("positive","negative","neutral")) + #圖標
  ggtitle("還願 討論情緒") + 
  xlab("日期") + 
  ylab("分數") + 
  theme(text = element_text(family = "Heiti TC Light"))

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

練習練習～

剛剛我們將資料按照日期分群後，加總計算了每天的三種情緒分數。但是我們可以發現每天的文章數量造成辭彙的數量不一樣，所以情緒圖的走勢會有很大的起伏。請各位將每日的情緒總值進行標準化後再畫一次情緒的走勢圖。

公式：
positive = positive / (positive + negative + neutral)
negative = negative / (positive + negative + neutral)
neutral = neutral / (positive + negative + neutral)

hint：剛剛summarise那邊可以順便計算每日的情緒詞總數
hint2：最後會用到mutate喔

data_try <- data_sen %>% 
  mutate(positive=positive / (positive + negative + neutral),negative = negative / (positive + negative + neutral),neutral = neutral / (positive + negative + neutral))
head(data_try)

## # A tibble: 6 x 4
##   artDate    positive negative neutral
##   <date>        <dbl>    <dbl>   <dbl>
## 1 2019-02-17    0.       1.00    0.   
## 2 2019-02-18    0.273    0.846   0.472
## 3 2019-02-19    0.526    0.853   0.925
## 4 2019-02-20    0.494    0.832   0.907
## 5 2019-02-21    0.507    0.836   0.788
## 6 2019-02-22    0.430    0.901   0.907

data_try %>% ggplot(aes(x= artDate)) +
  geom_line(aes(y = positive, colour = "red")) +
  geom_line(aes(y = negative, colour = "blue")) +
  geom_line(aes(y = neutral, colour = "yellow")) + 
  scale_x_date(labels = date_format("%Y/%m/%d")) +#時間格式轉換
  scale_color_discrete(name="情緒種類標準化後", labels = c("positive","negative","neutral")) + #圖標
  ggtitle("還願 討論情緒") + 
  xlab("日期") + 
  ylab("分數") + 
  theme(text = element_text(family = "Heiti TC Light"))

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

以文字分析平台之資料進行基本繪圖

MingLun Wu

2019/03/05

Ch.0 : 資料取得及觀念複習

1. 資料取得及套件載入

資料簡介

安裝需要的packages

資料欄位

Ch.1 日期折線圖

資料處理

Ch.2 文字雲

按照日期進行區分

練習練習～

Ch3. 長條圖

Ch.3 情緒折線圖

載入資料

練習練習～