第一組專案一

組員:
N074220002 陳柏翔 N074220008 林益祥 N074220010 賴軒如 N074220016 徐建德

主題：蒐集網路名人阿滴相關評論之文字分析

資料集

social_media_midterm_Di_word_count.csv(斷詞) social_media_midterm_Di_all.csv(完整)

動機與分析目的

阿滴一直是Youtube上的網路名人，所經營的阿滴英文頻道訂閱人數超過百萬人，他也以居高不下的網路聲量與正面形象積極參與公眾事務，尤其最近因疫情的關係，和其他知名人士發起募資活動想為台灣在國際間發聲，因此這樣一位深具個人特色與正面力量的網路名人，或許可以藉由文字分析進一步發現有價值的資訊供參考。

資料簡介

資料是由中山大學管理學院文字分析平台搜尋取得，下載資料為兩種，一種是原始資料之csv檔案；另一種為平台上斷詞好的csv檔案

資料區間為2017/07/07 ～ 2020/04/13八卦版資料，從2017年7月7日阿滴的頻道訂閱數破百萬，其資料數量可供分析參考，因此透過文字分析平台檢索「阿滴」關鍵字，共搜尋到443篇文章。

前置作業與載入相關套件使用

# 分析Dataset為《阿滴》單一名詞三年來的的討論文章集合  來源:PTT Gossiping Board
# 載入packages
require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(data.table)

## Loading required package: data.table

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

require(ggplot2)

## Loading required package: ggplot2

require(ggraph)

## Loading required package: ggraph

require(gutenbergr)

## Loading required package: gutenbergr

require(htmlwidgets)

## Loading required package: htmlwidgets

require(igraph)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

require(jiebaR)

## Loading required package: jiebaR

## Loading required package: jiebaRD

require(LDAvis)

## Loading required package: LDAvis

require(Matrix)

## Loading required package: Matrix

require(NLP)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

require(pbapply)

## Loading required package: pbapply

require(quanteda)

## Loading required package: quanteda

## Package version: 2.0.1

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

## The following object is masked from 'package:igraph':
## 
##     as.igraph

## The following object is masked from 'package:utils':
## 
##     View

require(readr)

## Loading required package: readr

require(reshape2)

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

require(Rtsne)

## Loading required package: Rtsne

require(randomcoloR)

## Loading required package: randomcoloR

require(stringr)

## Loading required package: stringr

require(slam)

## Loading required package: slam

## 
## Attaching package: 'slam'

## The following object is masked from 'package:data.table':
## 
##     rollup

require(scales)

## Loading required package: scales

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

require(servr)

## Loading required package: servr

require(tidytext)

## Loading required package: tidytext

require(tidyr)

## Loading required package: tidyr

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

## The following object is masked from 'package:Matrix':
## 
##     expand

## The following object is masked from 'package:igraph':
## 
##     crossing

require(tm)

## Loading required package: tm

## 
## Attaching package: 'tm'

## The following objects are masked from 'package:quanteda':
## 
##     as.DocumentTermMatrix, stopwords

require(topicmodels)

## Loading required package: topicmodels

require(wordcloud2)

## Loading required package: wordcloud2

require(widyr)

## Loading required package: widyr

require(wordcloud)

## Loading required package: wordcloud

## Loading required package: RColorBrewer

require(webshot)

## Loading required package: webshot

# 設定讀取字典資料路徑
ROOT.DIR<- "C:/Users/Sean/Documents/20200401_project_1"

data=fread("C:/Users/Sean/Documents/20200401_project_1/1_Dataset/social_media_midterm_Di_word_count.csv",encoding = "UTF-8")
data$artDate= data$artDate %>% as.Date("%Y/%m/%d")
data<-data %>% filter(!grepl('_',word)) 
head(data, 20)

##                        artTitle    artDate  artTime
## 1  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 2  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 3  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 4  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 5  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 6  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 7  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 8  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 9  [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 10 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 11 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 12 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 13 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 14 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 15 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 16 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 17 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 18 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 19 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 20 [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
##                                                      artUrl   word count
## 1  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   頻道    14
## 2  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   百萬    14
## 3  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   英文    11
## 4  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   台灣    10
## 5  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   阿滴     8
## 6  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   阿莫     7
## 7  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   這群     6
## 8  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   阿神     6
## 9  https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   訂閱     5
## 10 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html 蔡阿嘎     5
## 11 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   結石     5
## 12 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   創辦     5
## 13 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   目前     5
## 14 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   影片     4
## 15 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   六個     4
## 16 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   大學     4
## 17 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html 僅次於     3
## 18 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   沒有     3
## 19 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   超過     2
## 20 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html   達成     2

情緒分析

準備LIWC字典(utf8)

p<-read_file(file.path (ROOT.DIR , "3_Dict/liwc_positive.txt")) 
n<-read_file(file.path (ROOT.DIR , "3_Dict/liwc_negative.txt"))
positive <- strsplit(p, "[,]")[[1]]
negative <- strsplit(n, "[,]")[[1]]
positive <- data.frame(word = positive, sentiments = "positive")
negative <- data.frame(word = negative, sentiemtns = "negative")
colnames(negative) = c("word","sentiment")
colnames(positive) = c("word","sentiment")
LIWC_ch <- rbind(positive, negative)
head(LIWC_ch, 20)

##        word sentiment
## 1      一流  positive
## 2  下定決心  positive
## 3  不拘小節  positive
## 4    不費力  positive
## 5      不錯  positive
## 6      主動  positive
## 7      乾杯  positive
## 8      乾淨  positive
## 9    了不起  positive
## 10     享受  positive
## 11     仁心  positive
## 12     仁愛  positive
## 13     仁慈  positive
## 14     仁義  positive
## 15     仁術  positive
## 16     仔細  positive
## 17     付出  positive
## 18     伴侶  positive
## 19     伶俐  positive
## 20     作品  positive

LIWC字典中有多少正面單詞和負面單詞

LIWC_ch %>% filter(sentiment %in% c("positive", "negative")) %>% count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <fct>     <int>
## 1 positive    664
## 2 negative   1047

以LIWC字典判斷文集中的word屬於正面字還是負面字文集過濾詞彙，只有一個字則不列入計算

all_filter_words_lw <- data %>% filter(nchar(.$word)>1)

計算所有各詞彙的出現總數由大到小排序

all_words_count_lw <- all_filter_words_lw %>% group_by(word) %>% summarise(sum = n()) %>% arrange(desc(sum))
head(all_words_count_lw, 20) # 可依據出現詞彙於斷詞時判斷加入停用詞

## # A tibble: 20 x 2
##    word     sum
##    <chr>  <int>
##  1 阿滴     428
##  2 英文     193
##  3 台灣     171
##  4 大家     110
##  5 什麼     102
##  6 不是      96
##  7 八卦      92
##  8 沒有      92
##  9 真的      91
## 10 覺得      89
## 11 有沒有    88
## 12 現在      87
## 13 代表      83
## 14 影片      79
## 15 怎麼      64
## 16 看到      64
## 17 出來      63
## 18 是不是    60
## 19 廣告      58
## 20 很多      57

與LIWC情緒字典join，文集中的字出現在LIWC字典中是屬於positive還是negative

liwch_ch_words_count<-all_words_count_lw %>% inner_join(LIWC_ch)

## Joining, by = "word"

print(liwch_ch_words_count)

## # A tibble: 424 x 3
##    word    sum sentiment
##    <chr> <int> <fct>    
##  1 八卦     92 negative 
##  2 問題     43 negative 
##  3 支持     42 positive 
##  4 攻擊     29 negative 
##  5 希望     28 positive 
##  6 批評     28 negative 
##  7 喜歡     23 positive 
##  8 成功     21 positive 
##  9 幫忙     21 positive 
## 10 不錯     20 positive 
## # ... with 414 more rows

liwch_ch_words_count %>% filter(sentiment %in% c("positive", "negative")) %>% count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <fct>     <int>
## 1 positive    220
## 2 negative    204

繪圖出以LIWC字典統計的文集情緒字數，觀察兩種情緒值的差異

liwch_ch_words_count %>%
  group_by(sentiment) %>%
  top_n(10,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = 'Heiti TC Light'))+
  coord_flip()

文字雲

liwch_ch_words_count %>% wordcloud2(size = 4, fontFamily = "微軟雅黑",color = "random-light", backgroundColor = "grey")

以LIWC情緒字典分析

統計每天的文章正面字的次數與負面字的次數

找出所有在時間區段中的日期

all_dates <- 
  expand.grid(seq(as.Date(min(data$artDate)), as.Date(max(data$artDate)), by="day"), c("positive", "negative"))

plot_table<-data %>%
  select(artDate,word,count) %>%
  inner_join(LIWC_ch) %>% 
  group_by(artDate,sentiment) %>%
  summarise(count=sum(count))

## Joining, by = "word"

沒有資料的日期<NA>將count設為0

plot_table <- all_dates %>% 
  merge(plot_table, all.x=T,all.y=T) %>% 
  mutate(count = replace_na(count, 0))
plot_table %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%Y/%m/%d"))

情緒分析(續)

準備NTUSD字典(utf8)

pn<-read_file(file.path (ROOT.DIR , "3_Dict/ntusd_positive.txt"))
nn<-read_file(file.path (ROOT.DIR , "3_Dict/ntusd_negative.txt"))
ps<-strsplit(pn, "[\n]")[[1]]
positive_ntusd<-strsplit(ps, "[\r]")
ns<-strsplit(nn, "[\n]")[[1]]
negative_ntusd<-strsplit(ns, "[\r]")
# 用unlist拆分list後重構矩陣然後轉換為dataframe
positive_ntusd<-data.frame(matrix(unlist(positive_ntusd), nrow=2812, ncol=1, byrow=F),sentiments="positive", stringsAsFactors=FALSE)
colnames(positive_ntusd)<-c("word", "sentiment")
negative_ntusd<-data.frame(matrix(unlist(negative_ntusd), nrow=8276, ncol=1, byrow=F), sentiments="negative", stringsAsFactors = FALSE)
colnames(negative_ntusd)<-c("word","sentiment")
NTUSD_ch<-rbind(positive_ntusd, negative_ntusd)
head(NTUSD_ch, 20)

##          word sentiment
## 1    一帆風順  positive
## 2  一帆風順的  positive
## 3        一流  positive
## 4        一致  positive
## 5      一致的  positive
## 6      了不起  positive
## 7    了不起的  positive
## 8        了解  positive
## 9        人性  positive
## 10     人性的  positive
## 11   人格高尚  positive
## 12 人格高尚的  positive
## 13       人情  positive
## 14     人情味  positive
## 15       入神  positive
## 16     入神的  positive
## 17       入迷  positive
## 18     入迷的  positive
## 19       上好  positive
## 20     上好的  positive

NTUSD字典中有多少正面單詞和負面單詞

NTUSD_ch %>% filter(sentiment %in% c("positive", "negative")) %>% count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   8276
## 2 positive   2812

與NTUSD情緒字典join，文集中的字出現在NTUSD字典中是屬於positive還是negative

# 文集過濾詞彙，只有一個字則不列入計算
all_filter_words_nt <- data %>% filter(nchar(.$word)>1)

all_words_count_nt <- all_filter_words_nt %>% group_by(word) %>% summarise(sum = n()) %>% arrange(desc(sum))
head(all_words_count_nt, 20) # 可依據出現詞彙於斷詞時判斷加入停用詞

## # A tibble: 20 x 2
##    word     sum
##    <chr>  <int>
##  1 阿滴     428
##  2 英文     193
##  3 台灣     171
##  4 大家     110
##  5 什麼     102
##  6 不是      96
##  7 八卦      92
##  8 沒有      92
##  9 真的      91
## 10 覺得      89
## 11 有沒有    88
## 12 現在      87
## 13 代表      83
## 14 影片      79
## 15 怎麼      64
## 16 看到      64
## 17 出來      63
## 18 是不是    60
## 19 廣告      58
## 20 很多      57

ntusd_ch_words_count<-all_words_count_nt %>% inner_join(NTUSD_ch)

## Joining, by = "word"

print(ntusd_ch_words_count)

## # A tibble: 1,139 x 3
##    word    sum sentiment
##    <chr> <int> <chr>    
##  1 不是     96 negative 
##  2 沒有     92 negative 
##  3 沒有     92 negative 
##  4 很多     57 positive 
##  5 知道     49 positive 
##  6 問題     43 negative 
##  7 問題     43 negative 
##  8 不知     35 negative 
##  9 不知     35 negative 
## 10 當然     31 positive 
## # ... with 1,129 more rows

ntusd_ch_words_count %>% filter(sentiment %in% c("positive", "negative")) %>% count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative    684
## 2 positive    455

繪圖出以NTUSD字典統計的文集情緒字數，觀察兩種情緒值的差異

ntusd_ch_words_count %>%
  group_by(sentiment) %>%
  top_n(10,wt = sum) %>%
  ungroup() %>% 
  mutate(word = reorder(word, sum)) %>%
  ggplot(aes(word, sum, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  theme(text=element_text(size=14,family = 'Heiti TC Light'))+
  coord_flip()

文字雲

# ntusd_ch_words_count[-1:-5,] %>% wordcloud2(size = 4, fontFamily = "微軟雅黑",color = "random-light", backgroundColor = "grey") # row1~row5詞彙去除
# 文字雲只能出現一次，其餘使用以下指令顯示
knitr::include_graphics(file.path(ROOT.DIR, "2_Rcode/NTUSD_WordCloud.png"))

以NTUSD情緒字典分析

統計每天的文章正面字的次數與負面字的次數

找出所有在時間區段中的日期

all_dates <- 
  expand.grid(seq(as.Date(min(data$artDate)), as.Date(max(data$artDate)), by="day"), c("positive", "negative"))

plot_table<-data %>%
  select(artDate,word,count) %>%
  inner_join(NTUSD_ch) %>% 
  group_by(artDate,sentiment) %>%
  summarise(count=sum(count))

## Joining, by = "word"

沒有資料的日期<NA>將count設為0

plot_table <- all_dates %>% 
  merge(plot_table, all.x=T,all.y=T) %>% 
  mutate(count = replace_na(count, 0))
plot_table %>%
  ggplot()+
  geom_line(aes(x=artDate,y=count,colour=sentiment))+
  scale_x_date(labels = date_format("%Y/%m/%d"))

比較兩個字典分析差異

senti_by_date_liwc<-data %>% 
  inner_join(LIWC_ch) %>%
  
  group_by(artDate, sentiment) %>%
  summarise(n=sum(count)) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(method='LIWC')

## Joining, by = "word"

senti_by_date_ntusd<-data %>% 
  inner_join(NTUSD_ch) %>%
  group_by(artDate, sentiment) %>%
  summarise(n=sum(count)) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(method='NTUSD')

## Joining, by = "word"

比較兩個字典分析差異

bind_rows(senti_by_date_liwc,
          senti_by_date_ntusd) %>%
  
  filter(artDate>='2020-01-01') %>%
  ggplot(aes(x= artDate,y=sentiment,fill=method)) +
  
  geom_col(show.legend = FALSE) +
  scale_x_date(labels = date_format("%m/%d")) +
  facet_wrap(~method, ncol = 1, scales = "fixed")+
  geom_text(aes(label=sentiment))

「阿滴」的討論熱度折線圖顯示

data2=fread('C:/Users/Sean/Documents/20200401_project_1/1_Dataset/social_media_midterm_Di_all.csv',encoding = 'UTF-8')
data2$artDate<-data2$artDate %>% as.Date("%Y/%m/%d")
head(data2, 5)

##                                      artTitle    artDate  artTime
## 1:               [爆卦]阿滴英文突破百萬訂閱！ 2017-07-07 05:45:25
## 2:         [問卦]阿滴英文少了滴妹還剩下什麼？ 2017-07-08 07:22:49
## 3:                       [問卦]阿滴算帥哥嗎？ 2017-07-22 11:27:29
## 4: [新聞]阿滴英文彈琴帥炸！　滴妹飆唱《魚仔》 2017-07-23 00:18:37
## 5:                  [問卦]阿滴跟滴妹要選哪個? 2017-07-23 21:00:07
##                                                      artUrl    artPoster
## 1: https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html gamania10000
## 2: https://www.ptt.cc/bbs/Gossiping/M.1499527731.A.E6B.html     xxxxxmay
## 3: https://www.ptt.cc/bbs/Gossiping/M.1500752013.A.6B4.html    pengjoker
## 4: https://www.ptt.cc/bbs/Gossiping/M.1500798279.A.78A.html       Digo99
## 5: https://www.ptt.cc/bbs/Gossiping/M.1500872815.A.CF4.html     yabe5566
##       artCat commentNum push boo
## 1: Gossiping         81   42  16
## 2: Gossiping         14    5   1
## 3: Gossiping         30   12   2
## 4: Gossiping         28    9  10
## 5: Gossiping         28   19   1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           sentence
## 1: 各位大德晚安\n\n我國少見的清流頻道．阿滴英文（Ray Du English）\n於今日晚間已經確定成為\n台灣第六個訂閱人數超過百萬的Youtube頻道\n也是今年第四個\n\n　　1. 蔡阿嘎　　(2014.07.09)　　　146萬　 （依達成順序排列）\n　　2. 這群人　　(2016.08.30)      180萬\n　　3. 阿神　　　(2017.03.13)      114萬\n　　4. 谷阿莫　　(2017.04.27)      108萬\n　　5. 聖結石　　(2017.05.29)      115萬\n　　6.\n阿滴英文　(2017.07.07)      100萬\n相關資料如下：\n\n 一、阿滴英文是台灣第一個教育類百萬級頻道\n　 　但是頻道內點閱數最高的那部影片，跟英文／教育完全沒關係\n\n 二、阿滴英文創辦於2015年1月11日\n     歷經908天後達成百萬訂閱（十萬訂閱花了456天／2016年4月11日）\n     是台灣史上第三快的（僅次於聖結石220天與谷阿莫625天）\n\n 三、阿滴英文目前有189部影片，在百萬級頻道中只比這群人多\n 　  目前發表最多影片的是阿神（2835部），數量約是阿滴的15倍，這群人的22倍\n\n 四、雖然阿滴英文的訂閱數超過百萬，但是觀看次數不到五千萬\n 　  只有聖結石的四分之一，谷阿莫的七分之一\n　 　在台灣甚至排不到三十名\n\n 五、阿滴英文是台灣第一個由金牛座創辦的百萬級頻道（1989年5月8日晚上9點）\n     （這群人的尼克生於1988年4月20日，但那年太陽落於雙子座）\n　　 而且目前台灣的高訂閱youtuber中，幾乎沒有金牛座\n\n　　 蔡阿嘎（巨蟹）\n　　 這群人（摩羯／水瓶x2／雙魚／牡羊／雙子／巨蟹／獅子／處女）\n　　 阿　神（天秤）\n　　 谷阿莫（摩羯）\n　　 聖結石（天蠍）\n\n 六、六個百萬級頻道中，只有蔡阿嘎跟谷阿莫從未跟阿滴英文合作過影片\n     不過阿滴採訪過蔡阿嘎，發出合照的當晚IG立刻遭刪除\n\n 七、目前六個百萬級頻道的創辦人\n　　 都不是公立大學出身\n\n     蔡阿嘎　　東吳大學、中國文化大學\n　   這群人　　國立台灣戲曲學院\n　   阿神　　　明志科技大學\n     谷阿莫    龍華科技大學\n     聖結石    南強高級工商職業學校\n     阿滴      輔仁大學 　\n\n 八、阿滴英文是台灣第二個由碩士創辦的百萬級頻道\n　　 僅次於蔡阿嘎\n\n 九、阿滴英文是台灣第二個由同志天菜創辦的百萬級頻道\n     僅次於阿神\n\n 十、六個百萬級頻道中\n     沒有任何一個創辦人有近視（阿滴與滴妹曾經有但已雷射）\n     而除了阿滴、滴妹外，阿神也常戴無鏡框眼鏡\n\n 十一、六個百萬級頻道中\n　   　只有谷阿莫沒有在台灣辦過粉絲活動（見面會、簽名會...等）\n\n 十二、六個百萬級頻道中\n　　   只有阿神還未發表任何出版作品（書、單曲、電影...等）\n\n 十三、台灣目前還沒有任何一個由女性單獨創辦的百萬級頻道\n
## 2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         如題\n少了滴妹的阿滴\n還剩下什麼.\n.\n.\n.\n.\n剩阿滴\n.\n.\n.\n.\n欸這是廢話\n我是問還剩下那些內容可看\n大家覺得呢？\n
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                如題\n\n就是阿滴英文那個阿滴\n\n各位大大覺得他算帥哥嗎？\nhttp://i.imgur.com/rZxhy7z.jpg\n有沒有八卦？？？\n\n\n
## 4:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           阿滴英文彈琴帥炸！　滴妹飆唱《魚仔》甜聲融化網友\n\n記者劉宜庭／綜合報導\n\n阿滴靠著全台最大英文教學YouTube頻道「阿滴英文」走紅，近期更發行新書，擁有破百\n萬粉絲支持。他22日晚間在臉書上開直播，難得和滴妹合體，一起彈唱《魚仔》、《Mos\nt Girls》等數首歌曲，展現彈琴功力，一旁滴妹甜美嗓音，兄妹倆搭檔呈現絕佳默契，\n令粉絲一飽耳福。\n\n「阿滴英文」22日在臉書開直播，坦言這次難得和滴妹合作彈唱，是想藉此機會紀念聯合\n樂團主唱查斯特（Chester Bennington），「對我來說是很震撼的事情，因為我從小第一\n次接觸西洋音樂，就是透過Linkin Park。」\n\n看見查斯特自殺身亡的消息，阿滴除了感嘆惋惜，同時也不忘呼籲患有憂鬱症或是其他疾\n病的人，「希望大家不要把自殺當作是一個\n選則\n，因為活著總是有出路的。」他和滴妹合\n作唱《Numb》之前，也透露這首歌是他聽「聯合公園」的第一首歌，當時甚至還為此去學\n了一段Rap，表示該曲在國、高中時期對他來說有很大的影響力。\n\n這次直播除了紀念樂團「聯合公園」，阿滴和滴妹飆唱《What I've Done》、《Numb》之\n外，兄妹倆也彈唱了許多最近接觸到的歌，包括《魚仔》、《不曾回來過》等歌曲，難得\n展現彈琴功力，模樣被讚相當帥氣，搭配滴妹甜美嗓音，令粉絲聽地如癡如醉，在翻唱《\n魚仔》的過程中，兄妹倆還逗趣用台語對話，一來一往有趣互動逗樂眾人，引發不少討論\n。\n\n● 《ETNEWS新聞雲》提醒您，請給自己機會：\n自殺防治諮詢安心專線：0800-788995；生命線協談專線：1995\nhttp://star.ettoday.net/news/972597?from=fb_et_star
## 5:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            如題啦\n\n這兩位英文都很好\n\n阿滴笑起來很可愛\n\n滴妹鼻子很Q\n\n大家會選哪個?\n\n\n

日期折線圖計算出每天文章的發表數量，檢試疫情發生的這三年來「阿滴」的討論熱度

t_date <- data2 %>% 
  select(artDate, artUrl) %>% 
  distinct()
article_count_by_date <- t_date %>% 
  group_by(artDate) %>% 
  summarise(count = n())

討論篇數最多的前十天

article_count_by_date %>% 
  arrange(desc(count))%>% 
  top_n(10)

## Selecting by count

## # A tibble: 16 x 2
##    artDate    count
##    <date>     <int>
##  1 2020-04-12   136
##  2 2020-04-11   113
##  3 2020-04-13    13
##  4 2020-04-10     7
##  5 2018-10-28     5
##  6 2020-02-17     4
##  7 2018-01-20     3
##  8 2018-01-21     3
##  9 2018-03-31     3
## 10 2018-04-06     3
## 11 2018-04-07     3
## 12 2019-02-19     3
## 13 2019-12-06     3
## 14 2019-12-27     3
## 15 2020-01-31     3
## 16 2020-02-06     3

「阿滴」討論文章數的折線圖顯示

date_plot <- article_count_by_date %>% 
  ggplot(aes(x = artDate, y = count)) +
  geom_line(color = "purple", size = 1.5) + 
  geom_vline(xintercept = c(as.numeric(as.Date("2018-10-28")),
                            as.numeric(as.Date("2020-04-10")),
                            as.numeric(as.Date("2020-04-13")),
                            as.numeric(as.Date("2020-04-11")), 
                            as.numeric(as.Date("2020-04-12"))), col='red', size = 1) + 
  scale_x_date(labels = date_format("%Y/%m/%d")) +
  ggtitle("「阿滴」討論文章數") + 
  xlab("日期") + 
  ylab("數量") +
  theme(text = element_text(family = "Heiti TC Light"))
date_plot

解釋折線圖意義:

文章討論的高點還是集中在四月份的時候，因為阿滴這時候有出現一些新聞熱度出來，所以才會有折線高點的情況，不然平時的討論度其實不高。

文字雲

觀察有關阿滴熱門討論的內容先行斷詞與停用詞使用

jieba_tokenizer <- worker(user=file.path(ROOT.DIR, "3_Dict/di_dict.txt"), stop_word = file.path(ROOT.DIR, "5_Stopwords/stop_words.txt"))
d_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    tokens <- tokens[nchar(tokens)>1]
    return(tokens)
  })
}
# 斷詞
di_tokens <- data2 %>% 
  unnest_tokens(word, sentence, token=d_tokenizer) %>% 
  select(-artTime, -artUrl)
# 詞頻
di_tokens_count <- di_tokens %>% 
  group_by(word) %>% 
  summarise(sum = n()) %>% 
  arrange(desc(sum))
head(di_tokens_count, 10) # 根據輸出詞彙判斷是否加入停用詞文本(例如:https, com等)

## # A tibble: 10 x 2
##    word       sum
##    <chr>    <int>
##  1 阿滴       878
##  2 台灣       576
##  3 英文       522
##  4 代表       187
##  5 影片       158
##  6 youtuber   153
##  7 真的       135
##  8 覺得       134
##  9 世界       127
## 10 廣告       118

移除數量多但是較無意義的詞彙(在停用詞中無效果) 例如: 「阿滴」「真的」「覺得」

# wordc_plot <- di_tokens_count %>% 
#  filter(word != c("阿滴","真的","覺得")) %>% 
#  filter(sum > 40) %>% 
#   wordcloud2(size = 4, fontFamily = "微軟雅黑",color = "random-light", backgroundColor = "grey")
#wordc_plot
# 文字雲只能出現一次，其餘使用以下指令顯示
knitr::include_graphics(file.path(ROOT.DIR, "2_Rcode/WORDC_Plot.png"))

長條圖

查看詞彙頻率狀況

di_tokens_by_date <- di_tokens %>% 
  count(artDate, word, sort = TRUE)
# 過濾掉預期會存在的詞彙(阿滴, 英文)  
plot_merge <- di_tokens_by_date %>% 
  filter(word != "阿滴" & word != "英文") %>% 
  filter(artDate == as.Date("2020-04-13") | 
         artDate == as.Date("2020-04-12") | 
         artDate == as.Date("2020-04-11") |
         artDate == as.Date("2019-08-18") | 
         artDate == as.Date("2019-07-13")) %>% 
  group_by(artDate) %>% 
  top_n(7, n) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x=word, y=n, fill = artDate)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = NULL) +
  facet_wrap(~artDate, scales="free", ncol = 2) + 
  coord_flip()+
  theme(text = element_text(family = "Heiti TC Light"))
plot_merge

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

分析詞彙和文件頻率 tf-idf

以文章區隔document

# 內容(文章、詞彙、詞頻、長度)
di_tokens_by_art <- di_tokens %>% 
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artTitle, word, sort = TRUE)
di_total_words_by_art <- di_tokens_by_art %>% 
  group_by(artTitle) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total))
di_tokens_by_art <- left_join(di_tokens_by_art, di_total_words_by_art)

## Joining, by = "artTitle"

選擇文章長度大於50個詞

di_words_tf_idf <- di_tokens_by_art %>%
  bind_tf_idf(word, artTitle, n) 
di_words_tf_idf <- di_words_tf_idf %>% 
  filter(total > 50) %>% 
  arrange(desc(tf_idf))
knitr::kable(head(di_words_tf_idf, 40)) # 以kable函數來顯示以避免格式異常

artTitle	word	n	total	tf	idf	tf_idf
Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦	飲料店	5	51	0.0980392	5.817111	0.5703050
[新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了	理科	9	85	0.1058824	5.123964	0.5425374
[新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了	太太	8	85	0.0941176	5.123964	0.4822554
Re:[問卦]阿滴英文算過氣了嗎？	音標	12	149	0.0805369	5.817111	0.4684922
Re:[新聞]姪子是聖結石粉…他秒轉阿滴英文「導正	watch	6	51	0.1176471	3.178054	0.3738887
[問卦]有沒有哥倫布屌打阿滴的八卦？	哥倫布	4	53	0.0754717	4.718499	0.3561131
Re:[新聞]姪子是聖結石粉…他秒轉阿滴英文「導正	聖嫂	3	51	0.0588235	5.817111	0.3421830
Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?	爭鮮	4	70	0.0571429	5.817111	0.3324064
[新聞]阿滴開箱總統專機曝光空軍一號內裝	總統	35	300	0.1166667	2.821379	0.3291609
[問卦]阿滴是噗是妹控啊？	妹妹	8	63	0.1269841	2.559015	0.3249542
[新聞]阿滴開箱總統專機曝光空軍一號內裝	專機	16	300	0.0533333	5.817111	0.3102459
Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代	聖母	7	83	0.0843373	3.514526	0.2964058
Re:[問卦]阿滴英文背景的台灣中國地圖	上傳	5	66	0.0757576	3.871201	0.2932728
[新聞]台南人英語超強？阿滴、滴妹街頭實測結	台南	7	141	0.0496454	5.817111	0.2887928
[問卦]台大外文的英文能力是不是比阿滴還弱？	外文系	3	62	0.0483871	5.817111	0.2814731
[問卦]為什麼阿滴兄妹只上輔大	輔大	4	54	0.0740741	3.737670	0.2768644
Re:[新聞]姪子是聖結石粉…他秒轉阿滴英文「導正	www	6	51	0.1176471	2.290751	0.2695001
Re:[問卦]阿滴英文是靠個人魅力在撐嗎？	正派	3	65	0.0461538	5.817111	0.2684821
Re:[問卦]阿滴英文背景的台灣中國地圖	街訪	3	66	0.0454545	5.817111	0.2644141
Re:[新聞]姪子是聖結石粉…他秒轉阿滴英文「導正	聖結石	3	51	0.0588235	4.430817	0.2606363
[問卦]阿滴是不是最強YouTuber？	最強	3	52	0.0576923	4.430817	0.2556240
Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?	阿土	3	70	0.0428571	5.817111	0.2493048
[ＦＢ]阿滴談清流youtuber	gl	24	463	0.0518359	4.718499	0.2445874
[ＦＢ]阿滴談清流youtuber	goo	24	463	0.0518359	4.718499	0.2445874
Re:[新聞]姪子是聖結石粉…他秒轉阿滴英文「導正	youtube	6	51	0.1176471	2.055911	0.2418719
Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦	愚人節	3	51	0.0588235	4.025352	0.2367854
Re:[問卦]阿滴英文算過氣了嗎？	kk	6	149	0.0402685	5.817111	0.2342461
Re:[問卦]為啥阿滴突然被用打柯模式來洗版文狂罵啊?	roc	5	125	0.0400000	5.817111	0.2326844
[問卦]阿滴什麼時候停止疑似斂財的行為？	板民	3	75	0.0400000	5.817111	0.2326844
Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦	飲品	2	51	0.0392157	5.817111	0.2281220
[問卦]大謙和阿滴創新的頻道會叫什？	本魯	3	77	0.0389610	5.817111	0.2266407
[問卦]有沒有阿滴洗的比Jolin還多篇的八卦？	一下子	3	68	0.0441176	5.123964	0.2260572
[問卦]有沒有阿滴的英文程度如何的八卦??	西班牙文	2	52	0.0384615	5.817111	0.2237350
Re:[問卦]阿滴說不小心多募款了1500萬會好好使用	義呆利	2	52	0.0384615	5.817111	0.2237350
Re:[問卦]為何要給阿滴如此大的苛責？	賺錢	4	65	0.0615385	3.619887	0.2227623
Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?	好吃	3	70	0.0428571	5.123964	0.2195985
[問卦]有沒有哥倫布屌打阿滴的八卦？	哦哦哦	2	53	0.0377358	5.817111	0.2195136
[問卦]台大外文的英文能力是不是比阿滴還弱？	台大	3	62	0.0483871	4.430817	0.2143944
[問卦]所以阿滴真的要登了？？？？	歷史	2	55	0.0363636	5.817111	0.2115313
[問卦]阿滴這次的募資文案將窒礙難行的原因	執政	3	83	0.0361446	5.817111	0.2102570

tf-idf最高的文章本文:

結果真的要開啦！不是愚人節唬爛的滴妹跟阿滴在頻道上傳新影片提到真的要開飲料店，愚人節企劃不是唬爛，飲料店約4月底開幕，而且飲料店籌備將近一年，滴妹全程參與，包括飲品，不過飲料店裡面將不會販售愚人節影片提到的無熱量透明奶茶，只是給滴妹研發飲品，這間店的甜度應該會比一般飲料店甜很多吧

以日期區隔document

# 內容(日期、詞彙、詞頻、長度)
di_tokens_by_date <- di_tokens %>% 
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artDate, word, sort = TRUE)
di_total_words_by_date <- di_tokens_by_date %>% 
  group_by(artDate) %>% 
  summarize(total = sum(n)) %>% 
  arrange(desc(total))
di_tokens_by_date <- left_join(di_tokens_by_date, di_total_words_by_date)

## Joining, by = "artDate"

選擇文章長度大於50個詞

di_words_tf_idf_date <- di_tokens_by_date %>%
  bind_tf_idf(word, artDate, n) 
di_words_tf_idf_date<- di_words_tf_idf_date %>% 
  filter(total > 50) %>% filter(!str_detect(word, regex("[a-z]")))  %>% group_by(artDate) %>% top_n(1) %>% arrange(artDate) # word出現異常英文詞彙先行過濾

## Selecting by tf_idf

di_words_tf_idf_date

## # A tibble: 63 x 7
## # Groups:   artDate [58]
##    artDate    word         n total     tf   idf tf_idf
##    <date>     <chr>    <int> <int>  <dbl> <dbl>  <dbl>
##  1 2017-07-07 谷阿莫       7   235 0.0298  4.86 0.145 
##  2 2017-07-23 彈唱         3   193 0.0155  4.86 0.0755
##  3 2017-07-23 彈琴         3   193 0.0155  4.86 0.0755
##  4 2017-07-31 頻道        32   476 0.0672  1.30 0.0877
##  5 2017-08-08 聖結石      12   261 0.0460  3.76 0.173 
##  6 2017-08-19 妹妹         8    63 0.127   1.82 0.231 
##  7 2017-09-06 西班牙文     2    52 0.0385  4.86 0.187 
##  8 2018-01-19 片語         3    66 0.0455  3.47 0.158 
##  9 2018-01-20 滿強         3   127 0.0236  4.86 0.115 
## 10 2018-01-21 系列         3   109 0.0275  3.25 0.0895
## # ... with 53 more rows

找出前面五個日期討論高點

# 內容(日期、詞彙、詞頻、長度、tf、idf、tf_idf)
di_words_tf_idf_date %>% arrange(desc(n))

## # A tibble: 63 x 7
## # Groups:   artDate [58]
##    artDate    word       n total     tf   idf tf_idf
##    <date>     <chr>  <int> <int>  <dbl> <dbl>  <dbl>
##  1 2020-04-11 台灣     277  9422 0.0294  1.56 0.0460
##  2 2020-04-12 台灣     211  8400 0.0251  1.56 0.0393
##  3 2019-07-13 總統      35   300 0.117   2.91 0.340 
##  4 2017-07-31 頻道      32   476 0.0672  1.30 0.0877
##  5 2020-04-10 集資      22   853 0.0258  3.47 0.0896
##  6 2019-12-15 教育      15   291 0.0515  2.78 0.143 
##  7 2017-08-08 聖結石    12   261 0.0460  3.76 0.173 
##  8 2019-02-11 音標      12   149 0.0805  4.86 0.391 
##  9 2020-04-13 廣告      12   984 0.0122  3.47 0.0424
## 10 2019-12-06 勞動局    11   492 0.0224  4.17 0.0932
## # ... with 53 more rows

di_words_tf_idf_date %>% 
  filter(total > 50) %>% 
  filter(artDate == as.Date("2020-04-12") | 
         artDate == as.Date("2020-04-10") | 
         artDate == as.Date("2019-12-15") |
         artDate == as.Date("2019-07-13") | 
         artDate == as.Date("2017-07-31")) %>% 
  group_by(artDate) %>%  
  top_n(1) %>% 
  arrange(desc(artDate))

## Selecting by tf_idf

## # A tibble: 5 x 7
## # Groups:   artDate [5]
##   artDate    word      n total     tf   idf tf_idf
##   <date>     <chr> <int> <int>  <dbl> <dbl>  <dbl>
## 1 2020-04-12 台灣    211  8400 0.0251  1.56 0.0393
## 2 2020-04-10 集資     22   853 0.0258  3.47 0.0896
## 3 2019-12-15 教育     15   291 0.0515  2.78 0.143 
## 4 2019-07-13 總統     35   300 0.117   2.91 0.340 
## 5 2017-07-31 頻道     32   476 0.0672  1.30 0.0877

以下為討論高點日期內的文章 2020/4/12 : [新聞]楊蕙如質疑阿滴「Youtuber可代表台灣罵人(137篇文章) 2020/4/10 : [新聞]阿滴集資400萬達標！(7篇文章)　 2019/12/15 : [新聞]網紅跨界亞洲教育攜手阿滴英文開實體補習班(1篇文章) 2019/7/13 : [新聞]阿滴開箱總統專機曝光空軍一號內裝(1篇文章) 2017/7/31 : [ＦＢ]阿滴談清流youtuber(2篇文章)

前後三個字彙

找出常出現在「阿滴」附近的字根據n-1個item來預測第n個item

ngram_7 <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    ngram <- ngrams(tokens, 7)
    ngram <- lapply(ngram, paste, collapse = " ")
    unlist(ngram)
  })
}

di_ngram_7 <- data2 %>%
  select(artUrl, sentence) %>%
  unnest_tokens(ngram, sentence, token = ngram_7) %>%
  filter(!str_detect(ngram, regex("[0-9a-zA-Z]")))
di_ngrams_7_separated <- di_ngram_7 %>%
  separate(ngram, paste0("word", c(1:7),sep=""), sep = " ")
head(di_ngrams_7_separated)

##                                                     artUrl word1 word2
## 1 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  大德  晚安
## 2 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  晚安  我國
## 3 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  於今    日
## 4 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html    日  晚間
## 5 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  晚間  已經
## 6 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  已經  確定
##   word3 word4  word5  word6  word7
## 1  我國  少見   清流   頻道     滴
## 2  少見  清流   頻道     滴   英文
## 3  晚間  已經   確定   成為   台灣
## 4  已經  確定   成為   台灣 第六個
## 5  確定  成為   台灣 第六個   訂閱
## 6  成為  台灣 第六個   訂閱   人數

di_check_words <- di_ngrams_7_separated %>%
  filter((word4 == "阿滴"))
head(di_check_words)

##                                                     artUrl word1    word2
## 1 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html    中   蔡阿嘎
## 2 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  英文     合作
## 3 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  高級 工商職業
## 4 https://www.ptt.cc/bbs/Gossiping/M.1499435488.A.64E.html  曾經       已
## 5 https://www.ptt.cc/bbs/Gossiping/M.1499527731.A.E6B.html  如題       少
## 6 https://www.ptt.cc/bbs/Gossiping/M.1500752013.A.6B4.html  如題     阿滴
##    word3 word4    word5  word6 word7
## 1 谷阿莫  阿滴     英文   合作  影片
## 2   影片  阿滴     採訪   過蔡  阿嘎
## 3   學校  阿滴 輔仁大學     滴  英文
## 4   雷射  阿滴       滴     妹    外
## 5   滴妹  阿滴     剩下 剩阿滴    欸
## 6   英文  阿滴     覺得   他算  帥哥

di_check_words_count <- di_check_words %>%
  melt(id.vars = "artUrl", measure.vars = paste0("word", c(1:7),sep="")) %>%
  rename(word=value) %>%
  filter(variable!="word4") %>%
  filter(!(word %in% stop_words), nchar(word)>1) %>%
  count(word, sort = TRUE)
di_check_words_count %>%
  arrange(desc(abs(n))) %>%
  head(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = n > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("出現在「阿滴」附近的字") +
  ylab("出現次數") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light"))

Word Correlation(詞彙之間的相關性)

di_words_by_art <- data2 %>%
  unnest_tokens(word, sentence, token=d_tokenizer) %>%
  filter(!str_detect(word, regex("[0-9]"))) %>%
  count(artUrl, word, sort = TRUE)
di_word_pairs <- di_words_by_art %>%
  pairwise_count(word, artUrl, sort = TRUE)
di_word_pairs

## # A tibble: 2,169,152 x 3
##    item1 item2     n
##    <chr> <chr> <dbl>
##  1 阿滴  英文    196
##  2 英文  阿滴    196
##  3 阿滴  台灣    165
##  4 台灣  阿滴    165
##  5 真的  阿滴     90
##  6 阿滴  真的     90
##  7 覺得  阿滴     89
##  8 八卦  阿滴     89
##  9 阿滴  覺得     89
## 10 阿滴  八卦     89
## # ... with 2,169,142 more rows

di_word_cors <- di_words_by_art %>%
  group_by(word) %>%
  filter(n() >= 50) %>%
  pairwise_cor(word, artUrl, sort = TRUE)
di_word_cors %>%
  filter(item1 == "阿滴")

## # A tibble: 20 x 3
##    item1 item2    correlation
##    <chr> <chr>          <dbl>
##  1 阿滴  廣告         0.0703 
##  2 阿滴  真的         0.0601 
##  3 阿滴  youtuber     0.0555 
##  4 阿滴  影片         0.0523 
##  5 阿滴  英文         0.0355 
##  6 阿滴  應該         0.0320 
##  7 阿滴  覺得         0.0282 
##  8 阿滴  有沒有       0.0255 
##  9 阿滴  募款         0.0249 
## 10 阿滴  知道         0.0109 
## 11 阿滴  八卦        -0.00274
## 12 阿滴  現在        -0.00688
## 13 阿滴  根本        -0.0101 
## 14 阿滴  一堆        -0.0114 
## 15 阿滴  代表        -0.0123 
## 16 阿滴  台灣        -0.0155 
## 17 阿滴  滴妹        -0.0156 
## 18 阿滴  看到        -0.0343 
## 19 阿滴  是不是      -0.0415 
## 20 阿滴  美國        -0.0509

詞彙之間相關性轉成共現圖

seed_words <- c("新聞", "綜合", "appledaily")
threshold <- 0.65
remove_words <- di_word_cors %>%
                filter((item1 %in% seed_words|item2 %in% seed_words), correlation>threshold) %>%
                .$item1 %>%
                unique()
set.seed(2017)
di_word_cors_new <- di_word_cors %>%
                filter(!(item1 %in% remove_words|item2 %in% remove_words))
di_word_cors_new %>%
  filter(correlation > .03) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") +
  theme_void()

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

term_avg_tfidf <- di_words_tf_idf %>% 
  group_by(word) %>% 
  summarise(tfidf_avg = mean(tf_idf))
term_avg_tfidf$tfidf_avg %>% summary

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0009649 0.0120412 0.0226347 0.0315110 0.0365856 0.5703050

term_remove=term_avg_tfidf %>%  
  filter(tfidf_avg<0.0325254) %>% 
  .$word
term_remove %>% head

## [1] "<U+5218><U+754A>宏" "<U+5218>以" "<U+5F20><U+94A7>" "<U+5F20>惠妹" "<U+8D3A><U+519B>翔" "<U+8D56>冠霖"

di_dtm = di_words_tf_idf %>%
  filter(!word %in% term_remove) %>%
  cast_dtm(document=artTitle,term=word,value= n)

di_dtm

## <<DocumentTermMatrix (documents: 117, terms: 2088)>>
## Non-/sparse entries: 3582/240714
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

di_dtm_matrix = di_dtm %>% as.data.frame.matrix 
di_dtm_matrix[1:10,1:20]

##                                                 飲料店 理科 太太 音標
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                  5    0    0    0
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了           0    9    8    0
## Re:[問卦]阿滴英文算過氣了嗎？                        0    0    0   12
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正       0    0    0    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                   0    0    0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?      0    0    0    0
## [新聞]阿滴開箱總統專機曝光空軍一號內裝               0    0    0    0
## [問卦]阿滴是噗是妹控啊？                             0    0    0    0
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代        0    0    0    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                  0    0    0    0
##                                                 watch 哥倫布 聖嫂 爭鮮
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                 2      0    0    0
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了          0      0    0    0
## Re:[問卦]阿滴英文算過氣了嗎？                       0      0    0    0
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正      6      0    3    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                  0      4    0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?     0      0    0    4
## [新聞]阿滴開箱總統專機曝光空軍一號內裝              0      0    0    0
## [問卦]阿滴是噗是妹控啊？                            0      0    0    0
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代       0      0    0    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                 0      0    0    0
##                                                 總統 妹妹 專機 聖母 上傳
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                0    0    0    0    1
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了         0    2    0    0    0
## Re:[問卦]阿滴英文算過氣了嗎？                      0    0    0    0    0
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正     0    0    0    0    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                 0    0    0    0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?    0    0    0    0    0
## [新聞]阿滴開箱總統專機曝光空軍一號內裝            35    0   16    0    0
## [問卦]阿滴是噗是妹控啊？                           0    8    0    0    0
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代      0    0    0    7    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                0    0    0    0    5
##                                                 台南 外文系 輔大 正派 街訪
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                0      0    0    0    0
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了         0      0    0    0    0
## Re:[問卦]阿滴英文算過氣了嗎？                      0      0    0    0    0
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正     0      0    0    0    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                 0      0    0    0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?    0      0    0    0    0
## [新聞]阿滴開箱總統專機曝光空軍一號內裝             0      0    0    0    0
## [問卦]阿滴是噗是妹控啊？                           0      0    0    0    0
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代      0      0    0    0    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                0      0    0    0    3
##                                                 聖結石 最強
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                  0    0
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了           0    0
## Re:[問卦]阿滴英文算過氣了嗎？                        0    0
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正       3    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                   0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?      0    0
## [新聞]阿滴開箱總統專機曝光空軍一號內裝               0    0
## [問卦]阿滴是噗是妹控啊？                             0    0
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代        0    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                  0    0

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

clust = makeCluster(detectCores())
registerDoParallel(clust); getDoParWorkers()

## [1] 4

t0 = Sys.time()
d = di_dtm_matrix %>%
  dist(method="euclidean")  # 歐式距離，算文章與文章之間的距離
Sys.time() - t0

## Time difference of 0.1536241 secs

hc = hclust(d, method='ward.D')  
plot(hc, labels = FALSE, xlab = NULL)
rect.hclust(hc, k = 2, border="red")

t0 = Sys.time()
n = 2000   # n個字
tsne = di_dtm[, 1:n] %>% as.data.frame.matrix %>% scale %>% t %>% Rtsne(
check_duplicates = FALSE, theta=0.0, max_iter=3200)
Sys.time()-t0

## Time difference of 4.665123 mins

Y = tsne$Y              # tSNE coordinates
d_Y = dist(Y)           # distance matrix
hc_Y = hclust(d_Y )     # hi-clustering
plot(hc_Y,label=F)
rect.hclust(hc_Y, k=10, border="red")

kg = cutree(hc, k = 2)
L = split(di_dtm_matrix, kg)
L$`1`[1:10,1:10]

##                                                 飲料店 理科 太太 音標
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                  5    0    0    0
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了           0    9    8    0
## Re:[問卦]阿滴英文算過氣了嗎？                        0    0    0   12
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正       0    0    0    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                   0    0    0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?      0    0    0    0
## [問卦]阿滴是噗是妹控啊？                             0    0    0    0
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代        0    0    0    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                  0    0    0    0
## [新聞]台南人英語超強？阿滴、滴妹街頭實測結           0    0    0    0
##                                                 watch 哥倫布 聖嫂 爭鮮
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                 2      0    0    0
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了          0      0    0    0
## Re:[問卦]阿滴英文算過氣了嗎？                       0      0    0    0
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正      6      0    3    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                  0      4    0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?     0      0    0    4
## [問卦]阿滴是噗是妹控啊？                            0      0    0    0
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代       0      0    0    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                 0      0    0    0
## [新聞]台南人英語超強？阿滴、滴妹街頭實測結          1      0    0    0
##                                                 總統 妹妹
## Re:[爆卦]阿滴與滴妹要開珍珠奶茶店啦                0    0
## [新聞]理科太太收了滴妹喊話阿滴網讚「攻爆了         0    2
## Re:[問卦]阿滴英文算過氣了嗎？                      0    0
## Re:[新聞]姪子是聖結石粉...他秒轉阿滴英文「導正     0    0
## [問卦]有沒有哥倫布屌打阿滴的八卦？                 0    0
## Re:[問卦]館長/酷炫/阿滴/howhow在孤島生存的八卦?    0    0
## [問卦]阿滴是噗是妹控啊？                           0    8
## Re:[政治]阿滴募資紐時廣告楊蕙如：YouTuber可代      0    0
## Re:[問卦]阿滴英文背景的台灣中國地圖                0    0
## [新聞]台南人英語超強？阿滴、滴妹街頭實測結         0    0

sapply(L, function(x) x%>% colMeans %>% sort %>% tail %>% names)

##      1         2       
## [1,] "youtube" "goo"   
## [2,] "完整"    "in"    
## [3,] "集資"    "world" 
## [4,] "滴妹"    "to"    
## [5,] "美國"    "總統"  
## [6,] "代表"    "taiwan"

K = 10                            # number of clusters
g = cutree(hc_Y,K)                # cut into K clusters
table(g) %>% as.vector %>% sort   # sizes of clusters

##  [1]  82 111 117 128 132 156 179 236 302 557

wc = col_sums(di_dtm[,1:n])       # n個字
colors = distinctColorPalette(K)
png("./di.png", width=3200, height=1800)  # 輸出圖片到路徑下
textplot(Y[,1], Y[,2], colnames(di_dtm)[1:n], show=F,
col=colors[g],
cex= 0.3 + 1.25 * sqrt(wc/mean(wc)),
font=2, family = "Heiti TC Light")
dev.off()

## png 
##   2

knitr::include_graphics(file.path(ROOT.DIR, "2_Rcode/di.png"))

結論

阿滴在PTT版上的討論度在最近一個月有較高的成長，原因是因為疫情的關係讓他在四月份決定號召其他人共同集資並刊登廣告在紐約時報上為台灣發聲，因此，在四月中的網路文章討論度就很高，但是關於他的本業英文議題來說就沒有得到比較高的關注，所以其實文字分析可能還是針對某一事件發生的特殊時間點來做資料收集與分析會得到比較多有意義且好玩的結果。如果長期雖然都有評論文章的產生，但是討論度或是數量不高的話，其實也只能得到與本身特質或既定事情有關連的分析結果。

第一組專案一

社群媒體分析_期中報告

2020-04-27 23:25:56

主題：蒐集網路名人阿滴相關評論之文字分析

資料集

動機與分析目的

資料簡介

前置作業與載入相關套件使用

情緒分析

文字雲

以LIWC情緒字典分析

情緒分析(續)

繪圖出以NTUSD字典統計的文集情緒字數，觀察兩種情緒值的差異

文字雲

以NTUSD情緒字典分析

比較兩個字典分析差異

「阿滴」的討論熱度折線圖顯示

「阿滴」討論文章數的折線圖顯示

解釋折線圖意義:

文字雲

長條圖

分析詞彙和文件頻率 tf-idf

tf-idf最高的文章本文:

前後三個字彙

Word Correlation(詞彙之間的相關性)

詞彙之間相關性轉成共現圖

結論

第一組 專案一

社群媒體分析_期中報告

2020-04-27 23:25:56

主題：蒐集網路名人阿滴相關評論之文字分析

資料集

動機與分析目的

資料簡介

前置作業與載入相關套件使用

情緒分析

文字雲

以LIWC情緒字典分析

情緒分析(續)

繪圖出以NTUSD字典統計的文集情緒字數，觀察兩種情緒值的差異

文字雲

以NTUSD情緒字典分析

比較兩個字典分析差異

「阿滴」的討論熱度折線圖顯示

「阿滴」討論文章數的折線圖顯示

解釋折線圖意義:

文字雲

長條圖

分析詞彙和文件頻率 tf-idf

tf-idf最高的文章本文:

前後三個字彙

Word Correlation(詞彙之間的相關性)

詞彙之間相關性轉成共現圖

結論

第一組專案一