使用文字分析平台抓取之 PTT Gossiping 版中討論YouTuber的資料,以R語言套件進行資料視覺化。

動機與目的

YouTube是現今許多人常使用的影音平台,這使得網紅經濟的崛起,因此越來越多人當YouTuber經營自己的頻道。而YouTuber也變成時常在許多論壇上被提及的話題,在這次期中專案中,我們使用中山文字分析平台抓取PTT論壇上的文字資料,希望能透過一系列的分析對了解影響YouTuber聲量的主要因素,以及針對個別YouTuber得到相關議題。

資料基本介紹

  • 資料來源:PTT Gossiping
  • 衡量指標:DallyView 網路溫度計 YouTuber排行榜|2021
  • 抓取時間:2020/01/01 ~ 2021/04/27
  • YouTuber簡介:
  1. 館長:目前訂閱105萬,本身有在台灣開設健身房,拍攝健身相關影片、開直播玩遊戲、講時事。前陣子因為與黑道有糾紛,在自己林口館健身房門口,遭受槍手開槍,身中3槍,而在他中槍的當下,還有開直播,但沒隔多久,館長就繼續健身。
  2. 蒼藍鴿:醫生兼YouTuber,目前訂閱52.3萬,頻道主要是分享醫學知識,之前因為肝膽排石法而與愛莉莎莎有了一些爭執。
  3. 愛莉莎莎:目前訂閱117萬,在韓國留學時期開始經營YouTube頻道,以分享當地美妝、旅遊流行情報而具有知名度。前陣子因為實施某本書的”肝膽排石法”,而有一陣聲量。
  4. 蔡阿嘎:目前訂閱247萬,拍攝事議題影片、搞笑影片。前陣子有一件關於蔡阿嘎兒子,慶祝蔡桃貴兩歲生日於三重捷運站打造關於蔡桃貴的主題。
  5. 黃氏兄弟:目前訂閱166萬,頻道主要在作生活類的影片,之前會有爭議是因為其中的成員瑋瑋約男同志一夜情,而在PTT被大肆討論。

Part 1: 載入套件與資料前處理

require(readr)
require(dplyr)
require(stringr)
require(jiebaR)
require(tidytext)
require(NLP)
require(tidyr)
require(ggplot2)
require(wordcloud2)
require(ggraph)
require(igraph)
require(scales)
require(reshape2)
require(widyr)
require(htmlwidgets)
require(webshot)
讀取資料
# 讀資料
ytr <- read_csv("youtuber.csv") %>% 
  mutate(sentence=gsub("[\n]{2,}", "。", sentence)) %>% 
  mutate(sentence=gsub("\n", "", sentence)) %>% 
  mutate(sentence=gsub("http(s)?[-:\\/A-Za-z0-9\\.]+", " ", sentence))
## 
## -- Column specification --------------------------------------------------------
## cols(
##   artTitle = col_character(),
##   artDate = col_date(format = ""),
##   artTime = col_time(format = ""),
##   artUrl = col_character(),
##   artPoster = col_character(),
##   artCat = col_character(),
##   commentNum = col_double(),
##   push = col_double(),
##   boo = col_double(),
##   sentence = col_character()
## )
去除新聞來源之類的字詞
#去除新聞來源之類的字詞
ytr <- ytr %>% 
  mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除", "", sentence))
head(ytr)
## # A tibble: 6 x 10
##   artTitle   artDate    artTime  artUrl  artPoster artCat commentNum  push   boo
##   <chr>      <date>     <time>   <chr>   <chr>     <chr>       <dbl> <dbl> <dbl>
## 1 [新聞]3萬6手錶~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
## 2 [爆卦]館長:王定~ 2019-12-31 18:27:38 https:~ v840122v  Gossi~        260    56    82
## 3 Re:[新聞]3萬~ 2019-12-31 19:44:33 https:~ soulboy3~ Gossi~         19     9     3
## 4 Re:[爆卦]館長~ 2019-12-31 21:01:04 https:~ fckj1131~ Gossi~          6     3     0
## 5 [問卦]愛莉莎莎手~ 2020-01-01 00:03:32 https:~ f17690815 Gossi~         24     7    10
## 6 [問卦]館長跨年在~ 2020-01-01 01:18:44 https:~ kimo6414  Gossi~          8     5     0
## # ... with 1 more variable: sentence <chr>
jieba斷詞引擎初始化
#jieba斷詞初始化
jieba_tokenizer = worker()
#建立斷詞器
chi_tokenizer <- function(t) {
  lapply(t, function(x) {
    if(nchar(x)>1){
      tokens <- segment(x, jieba_tokenizer)
      # 去掉字串長度爲1的詞彙
      tokens <- tokens[nchar(tokens)>1]
      return(tokens)
    }
  })
}
讀入前五熱門的YouTuber名字
ytr_lexicon <- scan(file = "youtuber.txt", what=character(),sep='\n', 
                    encoding='utf-8',fileEncoding='utf-8',quiet = T)
ytr_lexicon
## [1] "館長"     "蒼藍鴿"   "愛莉莎莎" "蔡阿嘎"   "黃氏兄弟"
把前五熱門的YouTuber名字加入斷詞
new_user_word(jieba_tokenizer, c(ytr_lexicon))
## [1] TRUE
斷詞並去除數字、計算word出現次數
token <- ytr %>%
  unnest_tokens(word, sentence, token=chi_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9]")) ) %>%
  count(artDate, artUrl, word, sort = TRUE)
head(token)
## # A tibble: 6 x 4
##   artDate    artUrl                                                  word      n
##   <date>     <chr>                                                   <chr> <int>
## 1 2021-01-20 https://www.ptt.cc/bbs/Gossiping/M.1611172657.A.EB7.ht~ the     112
## 2 2020-08-31 https://www.ptt.cc/bbs/Gossiping/M.1598847524.A.9F8.ht~ 告訴     91
## 3 2020-08-31 https://www.ptt.cc/bbs/Gossiping/M.1598847524.A.9F8.ht~ 被告     74
## 4 2020-09-26 https://www.ptt.cc/bbs/Gossiping/M.1601093160.A.D4E.ht~ 小黃瓜~    68
## 5 2020-09-03 https://www.ptt.cc/bbs/Gossiping/M.1599109478.A.C71.ht~ 被告     54
## 6 2020-08-31 https://www.ptt.cc/bbs/Gossiping/M.1598847524.A.9F8.ht~ 紀錄     50

根據觀察,在word中有不重要的英文字詞,後面將再針對這些內容進行清理。

分別抓取有含YouTuber和不含YouTuber之文字資料

在此部分的文字處理中,除了抓出含所有YouTuber之資料,也抓出只含單一YouTuber之資料供後續詞頻分析。

#抓取有youtuber名字之資料
isname=token %>%
  filter(word %in% ytr_lexicon)

#抓取沒有youtuber名字之資料 並去除英文
other=token %>%
  filter(!word %in% ytr_lexicon) %>% 
  filter(!str_detect(word, regex("[a-zA-z]")) )

alisasa=token %>%
  filter(word %in% c("愛莉莎莎","alisasa"))

aga=token %>%
  filter(word %in% c("蔡阿嘎"))
  
guan=token %>%
  filter(word %in% c("成吉思汗","陳之漢","飆捍","館長"))
  
huang=token %>%
  filter(word %in% c("黃氏兄弟","瑋瑋","哲哲"))
  
pigeon=token %>%
  filter(word %in% c("蒼藍鴿"))

list_all=left_join(isname, other, by = "artUrl")
list_alisasa=left_join(alisasa, other, by = "artUrl")
list_aga=left_join(aga, other, by = "artUrl")
list_guan=left_join(guan, other, by = "artUrl")
list_huang=left_join(huang, other, by = "artUrl")
list_pigeon=left_join(pigeon, other, by = "artUrl")
合併兩個dataframe,整合YouTuber可能的別名,以利進行後續相關性分析
#合併以上兩dataframe 並將youtuber的別名 改為一個主要的名稱 以便後續分析
ytr_final <- rbind(isname,other) %>%   
  mutate(word=gsub("alisasa", "愛莉莎莎", word)) %>% 
  mutate(word=gsub("成吉思汗|陳之漢|飆捍", "館長", word))

以上為資料前處理,接下來將針對詞頻、情緒、相關性等進行分析、資料視覺化、結果解釋等。

Part 2: 詞頻分析

#comment <- list_all %>% 
  #group_by(word.y) %>% 
  #summarise(sum = sum(n.y), .groups = 'drop') %>% 
  #arrange(desc(sum))

result_all <- list_all %>% 
  group_by(artDate.y,word.x) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_alisasa <- list_alisasa %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_aga <- list_aga %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_guan <- list_guan %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_huang <- list_huang %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))

result_pigeon <- list_pigeon %>% 
  group_by(word.y) %>% 
  summarise(sum = sum(n.y), .groups = 'drop') %>% 
  arrange(desc(sum))
產生文字雲
wordcloud1 <- result_alisasa %>% filter(sum>200) %>% wordcloud2()
graph1<- wordcloud1
saveWidget(graph1, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud1.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud2 <- result_aga %>% filter(sum>100) %>% wordcloud2()
graph2<- wordcloud2
saveWidget(graph2, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud2.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud3 <- result_guan %>% filter(sum>150) %>% wordcloud2()
graph3 <- wordcloud3
saveWidget(graph3, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud3.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud4 <- result_huang %>% filter(sum>20) %>% wordcloud2()
graph4 <- wordcloud4
saveWidget(graph4, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud4.png", delay = 5, vwidth = 500, vheight = 500)

wordcloud5 <- result_pigeon %>% filter(sum>50) %>% wordcloud2()
graph5 <- wordcloud5
saveWidget(graph5, "temp.html", selfcontained = F)
webshot("temp.html", "wordcloud5.png", delay = 5, vwidth = 500, vheight = 500)

Youtuber 流量分析
flow <- result_all %>% 
  group_by(word.x) %>% 
  arrange(desc(sum))
flow_all_season <- result_all %>%
  #group_by(word.y) %>% 
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  labs(x = "時間", y = "流量", colour = "YouTuber")
  
flow_season_1 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-01-01','2020-03-31')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")

flow_season_2 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")

flow_season_3 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-07-01','2020-09-30')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")

flow_season_4 <- result_all %>%
  ggplot()+
  geom_line(aes(x = artDate.y,y = sum,colour = word.x))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))+
  labs(x = "時間", y = "流量", colour = "YouTuber")
flow_all_season

flow_season_1
## Warning: Removed 965 row(s) containing missing values (geom_path).

flow_season_2
## Warning: Removed 956 row(s) containing missing values (geom_path).

flow_season_3
## Warning: Removed 890 row(s) containing missing values (geom_path).

flow_season_4
## Warning: Removed 932 row(s) containing missing values (geom_path).

Part 3: 情緒分析

在此部分因為需要考慮文章的推或噓,因此將資料簡單重新整理並斷詞

設定情緒字典
P <- read_file("positive.txt") # 正向字典txt檔
N <- read_file("negative.txt") # 負向字典txt檔

#字典txt檔讀進來是一整個字串
typeof(P)
## [1] "character"
typeof(N)
## [1] "character"
針對字典做斷詞
# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]

# 建立dataframe 有兩個欄位word,sentiments,word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047

# 把兩個字典拼在一起
LIWC = rbind(P, N)

# 檢視字典
head(LIWC)
##       word sentiment
## 1     一流  positive
## 2 下定決心  positive
## 3 不拘小節  positive
## 4   不費力  positive
## 5     不錯  positive
## 6     主動  positive
文章推、噓整理

推、噓可對應對文章的正面、負面情緒

ytr <- ytr%>%
  mutate(emo=(push-boo))
對文章進行分類
cate <- ytr%>%
  unnest_tokens(cate,artTitle,token = chi_tokenizer)%>%
  select(artUrl,cate)%>% 
  filter(cate %in% ytr_lexicon)

ytr <- ytr%>%
  inner_join(cate)
## Joining, by = "artUrl"
ytr_after_jieba <- ytr %>%
  unnest_tokens(word, sentence, token=chi_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9]")) )


ytr_after_jieba
## # A tibble: 456,114 x 12
##    artTitle  artDate    artTime  artUrl  artPoster artCat commentNum  push   boo
##    <chr>     <date>     <time>   <chr>   <chr>     <chr>       <dbl> <dbl> <dbl>
##  1 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  2 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  3 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  4 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  5 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  6 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  7 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  8 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
##  9 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
## 10 [新聞]3萬6手~ 2019-12-31 16:32:46 https:~ LoveMake~ Gossi~        146    82    16
## # ... with 456,104 more rows, and 3 more variables: emo <dbl>, cate <chr>,
## #   word <chr>
加入情緒字典並計算每日情緒
sentiment <- ytr_after_jieba %>%
  select(artDate,word,cate) %>%
  inner_join(LIWC) %>%
  mutate(senti_count = ifelse(sentiment =="positive",1,-1))%>%
  select(artDate,senti_count,cate) %>%
  group_by(artDate,cate)%>%
  summarise(sum = sum(senti_count))
## Joining, by = "word"
## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
情緒分析結果資料視覺化
all_season <- sentiment%>%
  filter(cate %in% "愛莉莎莎"|cate %in%"館長"|cate %in%"蒼藍鴿"|cate %in%"蔡阿嘎"|cate %in%"黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))


all_season

根據上圖可以看出,黃氏兄弟在2020年第二季、蒼藍鴿在2021年第一季、愛莉莎莎在2020年第二季和2021年第一季、蔡阿嘎在2020年第三季、館長在2020年第三季有比較顯著的變化,因此下面根據這些季數再繪製情緒分析結果圖:

### 個別
#### 黃氏兄弟:
one_season <- sentiment%>%
  filter(cate %in% "黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蒼藍鴿:
two_season <- sentiment%>%
  filter(cate %in% "蒼藍鴿")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

#### 愛莉莎莎
three_season_1 <- sentiment%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

three_season_2 <- sentiment%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蔡阿嘎
four_season <- sentiment%>%
  filter(cate %in% "蔡阿嘎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))

#### 館長
five_season <- sentiment%>%
  filter(cate %in% "館長")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))
one_season
## Warning: Removed 5 row(s) containing missing values (geom_path).

two_season
## Warning: Removed 1 row(s) containing missing values (geom_path).

three_season_1
## Warning: Removed 335 row(s) containing missing values (geom_path).

three_season_2
## Warning: Removed 351 row(s) containing missing values (geom_path).

four_season
## Warning: Removed 136 row(s) containing missing values (geom_path).

five_season
## Warning: Removed 343 row(s) containing missing values (geom_path).

情緒分數計算
ytr_emo <- ytr%>%
  select(artDate,cate,emo)%>%
  group_by(artDate,cate)%>%
  summarise(sum = sum(emo))
## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
情緒分數資料視覺化
### 整體
all_year_emo <- ytr_emo%>%
  filter(cate %in% "愛莉莎莎"|cate %in%"館長"|cate %in%"蒼藍鴿"|cate %in%"蔡阿嘎"|cate %in%"黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))
### 個別
#### 黃氏兄弟:
one_emo_season <- ytr_emo%>%
  filter(cate %in% "黃氏兄弟")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蒼藍鴿:
two_emo_season <- ytr_emo%>%
  filter(cate %in% "蒼藍鴿")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

#### 愛莉莎莎
three_emo_season_1 <- ytr_emo%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2021-01-01','2021-03-31')))

three_emo_season_2 <- ytr_emo%>%
  filter(cate %in% "愛莉莎莎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-04-01','2020-06-30')))

#### 蔡阿嘎
four_emo_season <- ytr_emo%>%
  filter(cate %in% "蔡阿嘎")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))

#### 館長
five_emo_season <- ytr_emo%>%
  filter(cate %in% "館長")%>%
  ggplot()+
  geom_line(aes(x = artDate,y = sum,colour = cate))+
  scale_x_date(limits = as.Date(c('2020-10-01','2020-12-31')))
one_emo_season
## Warning: Removed 6 row(s) containing missing values (geom_path).

two_emo_season
## Warning: Removed 1 row(s) containing missing values (geom_path).

three_emo_season_1
## Warning: Removed 346 row(s) containing missing values (geom_path).

three_emo_season_2
## Warning: Removed 360 row(s) containing missing values (geom_path).

four_emo_season
## Warning: Removed 148 row(s) containing missing values (geom_path).

five_emo_season
## Warning: Removed 353 row(s) containing missing values (geom_path).

Part 4: 相關字詞分析

計算Correlation值
ytr_cors  <- ytr_final %>%
  group_by(word) %>%
  filter(n() >= 10) %>%
  pairwise_cor(word, artUrl, sort = TRUE)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
整合別名後之YouTuber列表
gsub_ytr_lexicon <- scan(file = "youtuber.txt", what=character(),sep='\n', 
                    encoding='utf-8',fileEncoding='utf-8',quiet = T)
Correlation 分析結果繪圖
ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[1]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "相關詞", y = "Correlation")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[2]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[3]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[4]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

ytr_cors %>%
  filter(item1 %in% gsub_ytr_lexicon[5]) %>%
  group_by(item1) %>%
  slice_head(n=15) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  labs(x = "Correlation", y = "相關詞")

Part 5: 共現關係圖

接續前一部分的的共現字詞分析,以下針對五個YouTuber做共現關係圖:

ytr_cors_coocc = ytr_cors %>% filter(item1 %in% gsub_ytr_lexicon[1:5])
ytr_cors_coocc %>%
  filter(correlation>0.18) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

結論

  • 值得一提的是,透過共現關係圖看愛莉莎莎與蒼藍鴿的事件。我們發現愛莉莎莎那邊沒有怎麼提到蒼藍鴿,反而是蒼藍鴿那邊多次連結到愛莉莎莎。我們認為合理的解釋是,愛莉莎莎的討論議題相當的廣,所以這次事件被稀釋掉。而大家平時並沒有太多討論蒼藍鴿,所以這次事件才顯得特別明顯。換句話說,會有這樣的結果和平時討論的頻率、議題數量有關。同時,也側面說明了愛莉莎莎和蒼藍鴿的人氣不在同一個層級上。
  • 從情緒分析的折線圖我們可以發現在事件發生時,討論的情緒都會有很大的波動,這是非常合理的。然而我們當我們往後續去看,會發現只要事件發生一段事件之後。討論的情緒曲線就會回歸正常。這和我們當初想的不太一樣。我們原本以為當重大事件發生之後,該網紅可能會因此黑掉或是變得形象很好,然而這樣的情況至少沒有反映在日常的討論、評論當中。