1092_社群媒體分析_期中專案

使用文字分析平台抓取之 PTT Gossiping 版中討論YouTuber的資料，以R語言套件進行資料視覺化。

動機與目的

YouTube是現今許多人常使用的影音平台，這使得網紅經濟的崛起，因此越來越多人當YouTuber經營自己的頻道。而YouTuber也變成時常在許多論壇上被提及的話題，在這次期中專案中，我們使用中山文字分析平台抓取PTT論壇上的文字資料，希望能透過一系列的分析對了解影響YouTuber聲量的主要因素，以及針對個別YouTuber得到相關議題。

資料基本介紹

資料來源：PTT Gossiping

衡量指標：DallyView 網路溫度計 YouTuber排行榜|2021

抓取時間：2020/01/01 ~ 2021/04/27

YouTuber簡介：

館長：目前訂閱105萬，本身有在台灣開設健身房，拍攝健身相關影片、開直播玩遊戲、講時事。前陣子因為與黑道有糾紛，在自己林口館健身房門口，遭受槍手開槍，身中3槍，而在他中槍的當下，還有開直播，但沒隔多久，館長就繼續健身。

蒼藍鴿：醫生兼YouTuber，目前訂閱52.3萬，頻道主要是分享醫學知識，之前因為肝膽排石法而與愛莉莎莎有了一些爭執。

愛莉莎莎：目前訂閱117萬，在韓國留學時期開始經營YouTube頻道，以分享當地美妝、旅遊流行情報而具有知名度。前陣子因為實施某本書的”肝膽排石法”，而有一陣聲量。

蔡阿嘎：目前訂閱247萬，拍攝事議題影片、搞笑影片。前陣子有一件關於蔡阿嘎兒子，慶祝蔡桃貴兩歲生日於三重捷運站打造關於蔡桃貴的主題。

黃氏兄弟：目前訂閱166萬，頻道主要在作生活類的影片，之前會有爭議是因為其中的成員瑋瑋約男同志一夜情，而在PTT被大肆討論。

Part 1: 載入套件與資料前處理

require(readr)
require(dplyr)
require(stringr)
require(jiebaR)
require(tidytext)
require(NLP)
require(tidyr)
require(ggplot2)
require(wordcloud2)
require(ggraph)
require(igraph)
require(scales)
require(reshape2)
require(widyr)
require(htmlwidgets)
require(webshot)

讀取資料

# 讀資料
ytr <- read_csv("youtuber.csv") %>% 
  mutate(sentence=gsub("[\n]{2,}", "。", sentence)) %>% 
  mutate(sentence=gsub("\n", "", sentence)) %>% 
  mutate(sentence=gsub("http(s)?[-:\\/A-Za-z0-9\\.]+", " ", sentence))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   artTitle = col_character(),
##   artDate = col_date(format = ""),
##   artTime = col_time(format = ""),
##   artUrl = col_character(),
##   artPoster = col_character(),
##   artCat = col_character(),
##   commentNum = col_double(),
##   push = col_double(),
##   boo = col_double(),
##   sentence = col_character()
## )

去除新聞來源之類的字詞

#去除新聞來源之類的字詞
ytr <- ytr %>% 
  mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", sentence))
head(ytr)

## # A tibble: 6 x 10
##   artTitle   artDate    artTime  artUrl  artPoster artCat commentNum  push   boo
##   <chr>      <date>     <time>   <chr>   <chr>     <chr>       <dbl> <dbl> <dbl>
## 1 [新聞]3萬6手錶~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
## 2 [爆卦]館長：王定~ 2019-12-31 18:27:38 https:~ v840122v  Gossi~        260    56    82
## 3 Re:[新聞]3萬~ 2019-12-31 19:44:33 https:~ soulboy3~ Gossi~         19     9     3
## 4 Re:[爆卦]館長~ 2019-12-31 21:01:04 https:~ fckj1131~ Gossi~          6     3     0
## 5 [問卦]愛莉莎莎手~ 2020-01-01 00:03:32 https:~ f17690815 Gossi~         24     7    10
## 6 [問卦]館長跨年在~ 2020-01-01 01:18:44 https:~ kimo6414  Gossi~          8     5     0
## # ... with 1 more variable: sentence <chr>

jieba斷詞引擎初始化

#jieba斷詞初始化
jieba_tokenizer = worker()
#建立斷詞器
chi_tokenizer <- function(t) {
  lapply(t, function(x) {
    if(nchar(x)>1){
      tokens <- segment(x, jieba_tokenizer)
      # 去掉字串長度爲1的詞彙
      tokens <- tokens[nchar(tokens)>1]
      return(tokens)
    }
  })
}

讀入前五熱門的YouTuber名字

ytr_lexicon <- scan(file = "youtuber.txt", what=character(),sep='\n', 
                    encoding='utf-8',fileEncoding='utf-8',quiet = T)
ytr_lexicon

## [1] "館長"     "蒼藍鴿"   "愛莉莎莎" "蔡阿嘎"   "黃氏兄弟"

把前五熱門的YouTuber名字加入斷詞

new_user_word(jieba_tokenizer, c(ytr_lexicon))

## [1] TRUE

斷詞並去除數字、計算word出現次數

token <- ytr %>%
  unnest_tokens(word, sentence, token=chi_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9]")) ) %>%
  count(artDate, artUrl, word, sort = TRUE)
head(token)

## # A tibble: 6 x 4
##   artDate    artUrl                                                  word      n
##   <date>     <chr>                                                   <chr> <int>
## 1 2021-01-20 https://www.ptt.cc/bbs/Gossiping/M.1611172657.A.EB7.ht~ the     112
## 2 2020-08-31 https://www.ptt.cc/bbs/Gossiping/M.1598847524.A.9F8.ht~ 告訴     91
## 3 2020-08-31 https://www.ptt.cc/bbs/Gossiping/M.1598847524.A.9F8.ht~ 被告     74
## 4 2020-09-26 https://www.ptt.cc/bbs/Gossiping/M.1601093160.A.D4E.ht~ 小黃瓜~    68
## 5 2020-09-03 https://www.ptt.cc/bbs/Gossiping/M.1599109478.A.C71.ht~ 被告     54
## 6 2020-08-31 https://www.ptt.cc/bbs/Gossiping/M.1598847524.A.9F8.ht~ 紀錄     50

根據觀察，在word中有不重要的英文字詞，後面將再針對這些內容進行清理。

分別抓取有含YouTuber和不含YouTuber之文字資料

在此部分的文字處理中，除了抓出含所有YouTuber之資料，也抓出只含單一YouTuber之資料供後續詞頻分析。

#抓取有youtuber名字之資料
isname=token %>%
  filter(word %in% ytr_lexicon)

#抓取沒有youtuber名字之資料 並去除英文
other=token %>%
  filter(!word %in% ytr_lexicon) %>% 
  filter(!str_detect(word, regex("[a-zA-z]")) )

alisasa=token %>%
  filter(word %in% c("愛莉莎莎","alisasa"))

aga=token %>%
  filter(word %in% c("蔡阿嘎"))
  
guan=token %>%
  filter(word %in% c("成吉思汗","陳之漢","飆捍","館長"))
  
huang=token %>%
  filter(word %in% c("黃氏兄弟","瑋瑋","哲哲"))
  
pigeon=token %>%
  filter(word %in% c("蒼藍鴿"))

list_all=left_join(isname, other, by = "artUrl")
list_alisasa=left_join(alisasa, other, by = "artUrl")
list_aga=left_join(aga, other, by = "artUrl")
list_guan=left_join(guan, other, by = "artUrl")
list_huang=left_join(huang, other, by = "artUrl")
list_pigeon=left_join(pigeon, other, by = "artUrl")

合併兩個dataframe，整合YouTuber可能的別名，以利進行後續相關性分析

#合併以上兩dataframe 並將youtuber的別名 改為一個主要的名稱 以便後續分析
ytr_final <- rbind(isname,other) %>%   
  mutate(word=gsub("alisasa", "愛莉莎莎", word)) %>% 
  mutate(word=gsub("成吉思汗|陳之漢|飆捍", "館長", word))

以上為資料前處理，接下來將針對詞頻、情緒、相關性等進行分析、資料視覺化、結果解釋等。

Part 2: 詞頻分析

#comment <- list_all %>% 
  #group_by(word.y) %>% 
  #summarise(sum = sum(n.y), .groups = 'drop') %>% 
  #arrange(desc(sum))

result_all <- list_all %>% 
  group_by(artDate.y,word.x) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_alisasa <- list_alisasa %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_aga <- list_aga %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_guan <- list_guan %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_huang <- list_huang %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_pigeon <- list_pigeon %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

產生文字雲

wordcloud1 <- result_alisasa %>% filter(sum>200) %>% wordcloud2()
graph1<- wordcloud1
saveWidget(graph1, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud1.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud2 <- result_aga %>% filter(sum>100) %>% wordcloud2()
graph2<- wordcloud2
saveWidget(graph2, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud2.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud3 <- result_guan %>% filter(sum>150) %>% wordcloud2()
graph3 <- wordcloud3
saveWidget(graph3, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud3.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud4 <- result_huang %>% filter(sum>20) %>% wordcloud2()
graph4 <- wordcloud4
saveWidget(graph4, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud4.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud5 <- result_pigeon %>% filter(sum>50) %>% wordcloud2()
graph5 <- wordcloud5
saveWidget(graph5, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud5.png", delay = 5, vwidth = 500, vheight = 500)

Youtuber 流量分析

flow <- result_all %>% 
  group_by(word.x) %>% 
  arrange(desc(sum))
flow_all_season <- result_all %>%
  #group_by(word.y) %>% 
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  labs(x = "時間", y = "流量", colour = "YouTuber")
  
flow_season_1 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-01-01','2020-03-31')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")

flow_season_2 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")

flow_season_3 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-07-01','2020-09-30')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")

flow_season_4 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")

flow_all_season

flow_season_1

## Warning: Removed 965 row(s) containing missing values (geom_path).

flow_season_2

## Warning: Removed 956 row(s) containing missing values (geom_path).

flow_season_3

## Warning: Removed 890 row(s) containing missing values (geom_path).

flow_season_4

## Warning: Removed 932 row(s) containing missing values (geom_path).

Part 3: 情緒分析

在此部分因為需要考慮文章的推或噓，因此將資料簡單重新整理並斷詞

設定情緒字典

P <- read_file("positive.txt") # 正向字典txt檔
N <- read_file("negative.txt") # 負向字典txt檔

#字典txt檔讀進來是一整個字串
typeof(P)

## [1] "character"

typeof(N)

## [1] "character"

針對字典做斷詞

# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]

# 建立dataframe 有兩個欄位word,sentiments，word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047

# 把兩個字典拼在一起
LIWC = rbind(P, N)

# 檢視字典
head(LIWC)

##       word sentiment
## 1     一流  positive
## 2 下定決心  positive
## 3 不拘小節  positive
## 4   不費力  positive
## 5     不錯  positive
## 6     主動  positive

文章推、噓整理

推、噓可對應對文章的正面、負面情緒

ytr <- ytr%>%
  mutate(emo=(push-boo))

對文章進行分類

cate <- ytr%>%
  unnest_tokens(cate,artTitle,token = chi_tokenizer)%>%
  select(artUrl,cate)%>% 
  filter(cate %in% ytr_lexicon)

ytr <- ytr%>%
  inner_join(cate)

## Joining, by = "artUrl"

ytr_after_jieba <- ytr %>%
  unnest_tokens(word, sentence, token=chi_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9]")) )


ytr_after_jieba

## # A tibble: 456,114 x 12
##    artTitle  artDate    artTime  artUrl  artPoster artCat commentNum  push   boo
##    <chr>     <date>     <time>   <chr>   <chr>     <chr>       <dbl> <dbl> <dbl>
##  1 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  2 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  3 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  4 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  5 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  6 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  7 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  8 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  9 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
## 10 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
## # ... with 456,104 more rows, and 3 more variables: emo <dbl>, cate <chr>,
## #   word <chr>

加入情緒字典並計算每日情緒

sentiment <- ytr_after_jieba %>%
  select(artDate,word,cate) %>%
  inner_join(LIWC) %>%
  mutate(senti_count = ifelse(sentiment =="positive",1,-1))%>%
  select(artDate,senti_count,cate) %>%
  group_by(artDate,cate)%>%
  summarise(sum = sum(senti_count))

## Joining, by = "word"

## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

情緒分析結果資料視覺化

all_season <- sentiment%>%
  filter(cate %in% "愛莉莎莎"|cate %in%"館長"|cate %in%"蒼藍鴿"|cate %in%"蔡阿嘎"|cate %in%"黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))


all_season

根據上圖可以看出，黃氏兄弟在2020年第二季、蒼藍鴿在2021年第一季、愛莉莎莎在2020年第二季和2021年第一季、蔡阿嘎在2020年第三季、館長在2020年第三季有比較顯著的變化，因此下面根據這些季數再繪製情緒分析結果圖：

### 個別
#### 黃氏兄弟：
one_season <- sentiment%>%
  filter(cate %in% "黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蒼藍鴿：
two_season <- sentiment%>%
  filter(cate %in% "蒼藍鴿")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

#### 愛莉莎莎
three_season_1 <- sentiment%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

three_season_2 <- sentiment%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蔡阿嘎
four_season <- sentiment%>%
  filter(cate %in% "蔡阿嘎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))

#### 館長
five_season <- sentiment%>%
  filter(cate %in% "館長")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))

one_season

## Warning: Removed 5 row(s) containing missing values (geom_path).

two_season

## Warning: Removed 1 row(s) containing missing values (geom_path).

three_season_1

## Warning: Removed 335 row(s) containing missing values (geom_path).

three_season_2

## Warning: Removed 351 row(s) containing missing values (geom_path).

four_season

## Warning: Removed 136 row(s) containing missing values (geom_path).

five_season

## Warning: Removed 343 row(s) containing missing values (geom_path).

情緒分數計算

ytr_emo <- ytr%>%
  select(artDate,cate,emo)%>%
  group_by(artDate,cate)%>%
  summarise(sum = sum(emo))

## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

情緒分數資料視覺化

### 整體
all_year_emo <- ytr_emo%>%
  filter(cate %in% "愛莉莎莎"|cate %in%"館長"|cate %in%"蒼藍鴿"|cate %in%"蔡阿嘎"|cate %in%"黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))

### 個別
#### 黃氏兄弟：
one_emo_season <- ytr_emo%>%
  filter(cate %in% "黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蒼藍鴿：
two_emo_season <- ytr_emo%>%
  filter(cate %in% "蒼藍鴿")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

#### 愛莉莎莎
three_emo_season_1 <- ytr_emo%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

three_emo_season_2 <- ytr_emo%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蔡阿嘎
four_emo_season <- ytr_emo%>%
  filter(cate %in% "蔡阿嘎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))

#### 館長
five_emo_season <- ytr_emo%>%
  filter(cate %in% "館長")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))

one_emo_season

## Warning: Removed 6 row(s) containing missing values (geom_path).

two_emo_season

## Warning: Removed 1 row(s) containing missing values (geom_path).

three_emo_season_1

## Warning: Removed 346 row(s) containing missing values (geom_path).

three_emo_season_2

## Warning: Removed 360 row(s) containing missing values (geom_path).

four_emo_season

## Warning: Removed 148 row(s) containing missing values (geom_path).

five_emo_season

## Warning: Removed 353 row(s) containing missing values (geom_path).

Part 4: 相關字詞分析

計算Correlation值

ytr_cors  <- ytr_final %>%
  group_by(word) %>%
  filter(n() >= 10) %>%
  pairwise_cor(word, artUrl, sort = TRUE)

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.

整合別名後之YouTuber列表

gsub_ytr_lexicon <- scan(file = "youtuber.txt", what=character(),sep='\n', 
                    encoding='utf-8',fileEncoding='utf-8',quiet = T)

Correlation 分析結果繪圖

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[1]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "相關詞", y = "Correlation")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[2]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[3]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[4]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[5]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

Part 5: 共現關係圖

接續前一部分的的共現字詞分析，以下針對五個YouTuber做共現關係圖：

ytr_cors_coocc = ytr_cors %>% filter(item1 %in% gsub_ytr_lexicon[1:5])

ytr_cors_coocc %>%
  filter(correlation>0.18) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

結論

值得一提的是，透過共現關係圖看愛莉莎莎與蒼藍鴿的事件。我們發現愛莉莎莎那邊沒有怎麼提到蒼藍鴿，反而是蒼藍鴿那邊多次連結到愛莉莎莎。我們認為合理的解釋是，愛莉莎莎的討論議題相當的廣，所以這次事件被稀釋掉。而大家平時並沒有太多討論蒼藍鴿，所以這次事件才顯得特別明顯。換句話說，會有這樣的結果和平時討論的頻率、議題數量有關。同時，也側面說明了愛莉莎莎和蒼藍鴿的人氣不在同一個層級上。

從情緒分析的折線圖我們可以發現在事件發生時，討論的情緒都會有很大的波動，這是非常合理的。然而我們當我們往後續去看，會發現只要事件發生一段事件之後。討論的情緒曲線就會回歸正常。這和我們當初想的不太一樣。我們原本以為當重大事件發生之後，該網紅可能會因此黑掉或是變得形象很好，然而這樣的情況至少沒有反映在日常的討論、評論當中。

1092_社群媒體分析_期中專案_第十二組

M094020002 鄭宇謙、M094020012 鍾秉恩、M094020020 張揚琛、M094020056 吳晉榮

2021/5/4

動機與目的

資料基本介紹

Part 1: 載入套件與資料前處理

讀取資料

去除新聞來源之類的字詞

jieba斷詞引擎初始化

讀入前五熱門的YouTuber名字

把前五熱門的YouTuber名字加入斷詞

斷詞並去除數字、計算word出現次數

分別抓取有含YouTuber和不含YouTuber之文字資料

合併兩個dataframe，整合YouTuber可能的別名，以利進行後續相關性分析

Part 2: 詞頻分析

產生文字雲

Youtuber 流量分析

Part 3: 情緒分析

設定情緒字典

針對字典做斷詞

文章推、噓整理

對文章進行分類

加入情緒字典並計算每日情緒

情緒分析結果資料視覺化

情緒分數計算

情緒分數資料視覺化

Part 4: 相關字詞分析

計算Correlation值

整合別名後之YouTuber列表

Correlation 分析結果繪圖

Part 5: 共現關係圖

結論