bitcoin

動機：雖然有越來越多的人用機器學習和深度學習的方式來預測股市和比特幣的走勢。但我一直認為市場瞬息萬變，用過去的資料來預測未來漲跌實在不太可靠。那麼在這麼多變因中有沒有什麼是不變的，大概是是「人性」吧。情緒也許可以反應出來一二。比特幣在5月23日的時候有一個明顯的下跌，所以想要分析在下跌的這一時間附近社群平台的討論都有什麼，同時情緒是怎樣的。

假設：

是否有提前的負面情緒的釋出？
情緒變化和價格變動是否有一定的相關性？

library(wordcloud2)
library(ggplot2)
library(scales)
library(rtweet)
library(dplyr)
library(xml2)
library(httr)
library(jsonlite)
library(magrittr)
library(data.tree)
library(tidytext)
library(stringr)
library(DiagrammeR)
library(magrittr)

bitcoin 的歷史价格數據

bitcoin的歷史數據：

網站讀取：his_data <- read.table（）
修改日期格式：as.POSIXct(his_data$Date)
儲存為 bitcoin_his_data.csv

# 讀取網頁中的歷史資料
his_data <- read.table("https://www.cryptodatadownload.com/cdd/gemini_BTCUSD_2021_1min.csv",header = T,sep = ",",skip=1 )
head(his_data)

##   Unix.Timestamp                Date Symbol     Open     High      Low    Close
## 1   1.625098e+12 2021-07-01 00:01:00 BTCUSD 35032.65 35060.60 35032.65 35060.60
## 2   1.625098e+12 2021-07-01 00:00:00 BTCUSD 35021.00 35137.79 35021.00 35032.65
## 3   1.625098e+12 2021-06-30 23:59:00 BTCUSD 35079.71 35079.71 35021.00 35021.00
## 4   1.625097e+12 2021-06-30 23:58:00 BTCUSD 35070.96 35090.38 35069.07 35079.71
## 5   1.625097e+12 2021-06-30 23:57:00 BTCUSD 35054.06 35076.16 35047.26 35070.96
## 6   1.625097e+12 2021-06-30 23:56:00 BTCUSD 35059.21 35064.30 35042.63 35054.06
##       Volume
## 1 0.02096801
## 2 0.15234732
## 3 0.09060078
## 4 0.02939092
## 5 0.37807248
## 6 0.37451313

his_data$Date <- as.POSIXct(his_data$Date)
head(his_data)

##   Unix.Timestamp                Date Symbol     Open     High      Low    Close
## 1   1.625098e+12 2021-07-01 00:01:00 BTCUSD 35032.65 35060.60 35032.65 35060.60
## 2   1.625098e+12 2021-07-01 00:00:00 BTCUSD 35021.00 35137.79 35021.00 35032.65
## 3   1.625098e+12 2021-06-30 23:59:00 BTCUSD 35079.71 35079.71 35021.00 35021.00
## 4   1.625097e+12 2021-06-30 23:58:00 BTCUSD 35070.96 35090.38 35069.07 35079.71
## 5   1.625097e+12 2021-06-30 23:57:00 BTCUSD 35054.06 35076.16 35047.26 35070.96
## 6   1.625097e+12 2021-06-30 23:56:00 BTCUSD 35059.21 35064.30 35042.63 35054.06
##       Volume
## 1 0.02096801
## 2 0.15234732
## 3 0.09060078
## 4 0.02939092
## 5 0.37807248
## 6 0.37451313

# "2021-05-20 13:48:32 UTC"
# "2021-05-28 08:26:10 UTC"

open_data <- his_data %>%
  filter(Date >= "2021-05-20 13:48:32" & Date <= "2021-06-19 08:26:10")%>%
  select(Date,Open)

head(open_data)

##                  Date     Open
## 1 2021-06-19 08:26:00 35745.18
## 2 2021-06-19 08:25:00 35699.35
## 3 2021-06-19 08:24:00 35721.90
## 4 2021-06-19 08:23:00 35733.24
## 5 2021-06-19 08:22:00 35712.97
## 6 2021-06-19 08:21:00 35693.15

open_data %>% 
  ggplot(aes(Date, Open)) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold")) +
  # geom_vline(xintercept = as.numeric(as.Date("2021-05-28")), col='red', size = 1) + 
  geom_line(col = '#ffa500') + 
  labs(title = 'Bitcoin', x = '')

bitcoin 的價格資料

# cited from: https://www.r-bloggers.com/querying-the-bitcoin-blockchain-with-r/
# install.packages("Rbitcoin")
library(Rbitcoin)

以Public API下載匯率 Bitcoin屬於permissionless的blockchain，任何人都可以使用Public API下載匯率資料：

wait <- antiddos(market = 'kraken', antispam_interval = 5, verbose = 1)
market.api.process('kraken',c('BTC','EUR'),'ticker')

##    market base quote           timestamp market_timestamp    last     vwap
## 1: kraken  BTC   EUR 2021-07-01 22:10:10             <NA> 28329.1 28751.47
##      volume     ask     bid
## 1: 3123.995 28336.4 28329.8

trades <- market.api.process('kraken',c('BTC','EUR'),'trades')
Rbitcoin.plot(trades, col='blue')

bitcoin 的 tweet data

有關Twitter上bitcoin的討論資料是用tweet連結Twitter API，以ETH和bitcoin為關鍵詞抓取的，只能抓取近10天的資料。

情緒分析

有关 load data 的资料：https://www.r-bloggers.com/2019/05/how-to-save-and-load-datasets-in-r-an-overview/

# load data 
# save(tweet_df, file = "Bitcoindata-2.Rdata")
bitcoin_data <- get(load("Bitcoindata-2.Rdata"))
head(bitcoin_data)

## # A tibble: 6 x 90
##   user_id   status_id   created_at          screen_name  text            source 
##   <chr>     <chr>       <dttm>              <chr>        <chr>           <chr>  
## 1 88987226… 1398193870… 2021-05-28 08:26:10 JojorMaria   "@CryptoPoorBo… Twitte…
## 2 14207956… 1398193720… 2021-05-28 08:25:34 Bitcoin_Win  "Ethereum Pric… Revive…
## 3 14207956… 1397439017… 2021-05-26 06:26:39 Bitcoin_Win  "Goldman Sachs… Revive…
## 4 14207956… 1397514509… 2021-05-26 11:26:38 Bitcoin_Win  "Crypto Market… Revive…
## 5 85748594… 1398193356… 2021-05-28 08:24:08 borsa_sabir… "Ha şunu da be… Twitte…
## 6 13671238… 1398193207… 2021-05-28 08:23:32 OladipoAde1  "@AirdropStari… Twitte…
## # … with 84 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
## #   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## #   is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## #   quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
## #   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
## #   media_url <list>, media_t.co <list>, media_expanded_url <list>,
## #   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
## #   ext_media_expanded_url <list>, ext_media_type <chr>,
## #   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
## #   quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
## #   quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

添加這篇Twitter的url，方便之後在做主題模型分析的時候可以比較方便的回溯，（例如找到最能代表某個主題的貼文）

BD_select <- bitcoin_data %>%
  select(user_id, status_id, created_at, screen_name, text, source, favorite_count, retweet_count) %>%
  mutate(Tweet_url = glue::glue("https://twitter.com/{screen_name}/status/{status_id}"))
  # select(DateTime = created_at, User = screen_name, text, Tweet_url, Likes = favorite_count, RTs = retweet_count, URLs = urls_expanded_url)

了解資料的資料筆數以及時間分布

created_at已經是一個date類型的欄位，因此可以直接用min,max來看最遠或最近的日期註:rtweet最多只能抓到距今10天的資料

# 粗略的每一天的數量分布 
ts_plot(BD_select)   # 5.21/5.23/5.24/5.27

# 查看每小時的發文數量"1 hours"，也可以是每分钟"1 mins"，或是自定义几个小时
ts_plot(BD_select, "1 hours") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold")) +
  labs(x = NULL, y = NULL,
    title = "Frequency of #bitcoin AND ETH Twitter statuses from past 8 days",
    subtitle = "Twitter status (tweet) counts aggregated using an hour intervals",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet")

min(BD_select$created_at) # "2021-05-20 13:48:32 UTC"

## [1] "2021-05-20 13:48:32 UTC"

max(BD_select$created_at) # "2021-05-28 08:26:10 UTC"

## [1] "2021-05-28 08:26:10 UTC"

tweets內容清理

## 用於資料清理
clean = function(txt) {
  txt = iconv(txt, "latin1", "ASCII", sub="") #改變字的encoding
  txt = gsub("(@|#)\\w+", "", txt) #去除@或#後有數字,字母,底線 (標記人名或hashtag)
  txt = gsub("(http|https)://.*", "", txt) #去除網址(.:任意字元，*:0次以上)
  txt = gsub("[ \t]{2,}", "", txt) #去除兩個以上空格或tab
  txt = gsub("\\n"," ",txt) #去除換行
  txt = gsub("\\s+"," ",txt) #去除一個或多個空格(+:一次以上)
  txt = gsub("^\\s+|\\s+$","",txt) #去除開頭/結尾有一個或多個空格
  txt = gsub("&.*;","",txt) #去除html特殊字元編碼
  txt = gsub("[^a-zA-Z0-9?!. ']","",txt) #除了字母,數字空白?!.的都去掉(表情符號去掉)
  txt }


BD_select$text = clean(BD_select$text)  #text套用資料清理

df = data.frame()
BD_df = rbind(df,BD_select)  # transfer to data frame
BD_df = BD_df[!duplicated(BD_df[,"status_id"]),]  #去除重複的tweets

df共有90個欄位，但我們在這裡僅會使用幾個欄位:

user_id: 用戶id status_id : 推文id created_at : 發文時間 screen_name：網名 text : 推文內容 source : 發文來源 favorite_count：按讚數 retweet_count：回覆數量 Tweet_url：原始推文的url

head(BD_df)

## # A tibble: 6 x 9
##   user_id  status_id created_at          screen_name text  source favorite_count
##   <chr>    <chr>     <dttm>              <chr>       <chr> <chr>           <int>
## 1 8898722… 13981938… 2021-05-28 08:26:10 JojorMaria  Come… Twitt…              0
## 2 1420795… 13981937… 2021-05-28 08:25:34 Bitcoin_Win Ethe… Reviv…              0
## 3 1420795… 13974390… 2021-05-26 06:26:39 Bitcoin_Win Gold… Reviv…              1
## 4 1420795… 13975145… 2021-05-26 11:26:38 Bitcoin_Win Cryp… Reviv…              0
## 5 8574859… 13981933… 2021-05-28 08:24:08 borsa_sabi… Ha u… Twitt…              0
## 6 1367123… 13981932… 2021-05-28 08:23:32 OladipoAde1 Awes… Twitt…              1
## # … with 2 more variables: retweet_count <int>, Tweet_url <glue>

把日期、小時、分鐘分開

BD_date <- BD_df%>%
  select(status_id,created_at,source)

BD_date$Dates <- format(as.Date(BD_date$created_at,"%Y-%m-%d"), format = "%Y-%m-%d")

## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'

BD_date$Hours <- format(as.POSIXct(BD_date$created_at, "%Y-%m-%d %H:%M:%S", tz = ""), format = "%H")
BD_date$Minutes <- format(as.POSIXct(BD_date$created_at, "%Y-%m-%d %H:%M:%S", tz = ""), format = "%M")

head(BD_date)

## # A tibble: 6 x 6
##   status_id         created_at          source            Dates    Hours Minutes
##   <chr>             <dttm>              <chr>             <chr>    <chr> <chr>  
## 1 1398193870828048… 2021-05-28 08:26:10 Twitter for Andr… 2021-05… 08    26     
## 2 1398193720130891… 2021-05-28 08:25:34 Revive Social App 2021-05… 08    25     
## 3 1397439017344331… 2021-05-26 06:26:39 Revive Social App 2021-05… 06    26     
## 4 1397514509137301… 2021-05-26 11:26:38 Revive Social App 2021-05… 11    26     
## 5 1398193356619014… 2021-05-28 08:24:08 Twitter Web App   2021-05… 08    24     
## 6 1398193207457062… 2021-05-28 08:23:32 Twitter for Andr… 2021-05… 08    23

串連 CoreNLP

(1). API呼叫的設定 server端呼叫伺服器 :

需先在terminal開啟corenlp server輸入：stanford-corenlp-4.2.0
在corenlp的路徑下開啟terminal輸入：java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

# 產生coreNLP的api url，將本地端的網址轉成符合coreNLP服務的url
generate_API_url <- function(host, port="9000",
                    tokenize.whitespace="false", annotators=""){ #斷詞依據不是空格
    url <- sprintf('http://%s:%s/?properties={"tokenize.whitespace":"%s","annotators":"%s"}', host, port, tokenize.whitespace, annotators)
    url <- URLencode(url) # 轉碼，把文字轉成百分號編碼(Percent-encoding)類型“%e8%98%8b%e6”
}

#指定服務的位置
host = "127.0.0.1"

generate_API_url(host)

# 呼叫coreNLP api
# 涉及的功能 annotators=c("tokenize","ssplit","pos","lemma","ner","parse","sentiment")

call_coreNLP <- function(server_host, text, host="localhost", language="eng",
                    tokenize.whitespace="true", ssplit.eolonly="true", annotators=c("tokenize","ssplit","pos","lemma","ner","parse","sentiment")){   
  # 假設有兩個core-nlp server、一個負責英文（使用9000 port）、另一個則負責中文（使用9001 port）
  port <- ifelse(language=="eng", 9000, 9001);
  # 產生api網址
  url <- generate_API_url(server_host, port=port,
                    tokenize.whitespace=tokenize.whitespace, annotators=paste0(annotators, collapse = ','))
  
  result <- POST(url, body = text, encode = "json")
  doc <- httr::content(result, "parsed","application/json",encoding = "UTF-8")
  return (doc)
}

#文件使用coreNLP服務
coreNLP <- function(data,host){
  # 依序將每個文件丟進core-nlp進行處理，每份文件的回傳結果為json格式
  # 在R中使用objects來儲存處理結果
  result <- apply(data, 1 , function(x){
    object <- call_coreNLP(host, x['text'])
    list(doc=object, data=x)
  })
  
  return(result)
}

解析抽取資料的functions

coreNLP_tokens_parser：解析 coreNLP 回傳的資料（爬蟲解析的部分），抽取出斷詞的結果
- do.call：do.call(what, args, quote = FALSE, envir = parent.frame()) 的功能就是執行一個函數，而這個函數的參數呢，是list的每個子元素。
- do.call + rbind：Parse Parse list of dataframes with
coreNLP_dependency_parser：抽取出依存關係的結果
coreNLP_sentiment_parser：抽取出語句情緒的結果

# 從回傳的object中整理斷詞出結果，輸出為 tidydata 格式
# 其實還厚很多的欄位，之後需要的話可以加入到這個function中 coreNLP_tokens_parser
coreNLP_tokens_parser <- function(coreNLP_objects){
  
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)!=0) # 新加入的條件
    sen <- sentences[[1]]
    
    # 抽取出斷詞出結果：tokens
    tokens <- do.call(rbind, lapply(sen$tokens, function(x){
      result <- data.frame(word=x$word, lemma=x$lemma, pos=x$pos, ner=x$ner)
      result
    }))
    
    tokens <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(tokens))) %>% 
      bind_cols(tokens)
    
    tokens
  }))
  return(result)
}

# 從回傳的core-nlp object中整理出詞彙依存關係，輸出為 tidydata 格式
coreNLP_dependency_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)!=0) # 新加入的條件
    sen <- sentences[[1]]
    
    # 抽取出依存關係：Dependencies
    dependencies <- do.call(rbind, lapply(sen$basicDependencies, function(x){
      result <- data.frame(dep=x$dep, governor=x$governor, governorGloss=x$governorGloss, dependent=x$dependent, dependentGloss=x$dependentGloss)
      result
    }))
    
    dependencies <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(dependencies))) %>% 
      bind_cols(dependencies)
    dependencies
  }))
  return(result)
}

# 從回傳的core-nlp object中整理出語句情緒，輸出為 tidydata 格式
coreNLP_sentiment_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)!=0) # 新加入的條件
    sen <- sentences[[1]]
    
    # 抽取出情緒：sentiment
    sentiment <- original_data %>%
      t() %>% 
      data.frame() %>% 
      bind_cols(data.frame(sentiment=sen$sentiment, sentimentValue=sen$sentimentValue))
  
    sentiment
  }))
  return(result)
}

資料可視化的function

parse2tree：用樹狀圖的方式可視化依存關係

# 圖形化顯示dependency結果
parse2tree <- function(ptext) {
  stopifnot(require(NLP) && require(igraph))
  
  # this step modifies coreNLP parse tree to mimic openNLP parse tree
  ptext <- gsub("[\r\n]", "", ptext)
  ptext <- gsub("ROOT", "TOP", ptext)


  ## Replace words with unique versions
  ms <- gregexpr("[^() ]+", ptext)                                      # just ignoring spaces and brackets?
  words <- regmatches(ptext, ms)[[1]]                                   # just words
  regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words))))  # add id to words
  
  ## Going to construct an edgelist and pass that to igraph
  ## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
  edgelist <- matrix('', nrow=length(words)-2, ncol=2)
  
  ## Function to fill in edgelist in place
  edgemaker <- (function() {
    i <- 0                                       # row counter
    g <- function(node) {                        # the recursive function
      if (inherits(node, "Tree")) {            # only recurse subtrees
        if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
          for (child in node$children) {
            childval <- if(inherits(child, "Tree")) child$value else child
            i <<- i+1
            edgelist[i,1:2] <<- c(val, childval)
          }
        }
        invisible(lapply(node$children, g))
      }
    }
  })()
  
  ## Create the edgelist from the parse tree
  edgemaker(Tree_parse(ptext))
  tree <- FromDataFrameNetwork(as.data.frame(edgelist))
  return (tree)
}

a <- BD_df[,c(2,5)] 
head(a)

## # A tibble: 6 x 2
##   status_id         text                                                        
##   <chr>             <chr>                                                       
## 1 1398193870828048… Come here beb                                               
## 2 1398193720130891… Ethereum Price Prediction ETH bids goodbye to 3000 as bulli…
## 3 1397439017344331… Goldman Sachs Ethereum ETH Might Overtake Bitcoin BTC As A …
## 4 1397514509137301… Crypto Market Tanks 14 to 3Month Low Under 1.35 Trillion Et…
## 5 1398193356619014… Ha unu da belirteyimveyaay oldu mu diyenler var. Ay piyasas…
## 6 1398193207457062… Awesome project

運行CoreNLP

取得coreNLP回傳的物件如果資料很多就要跑很久，所以最好跑一次之後將資料存儲下來

# gc() #釋放不使用的記憶體
# 
# t0 <-  Sys.time()
# obj <- BD_df[,c(2,5)]  %>% 
#   filter(text != "") %>% 
#   coreNLP(host) #丟入本地執行
# #丟入coreNLP的物件 必須符合: 是一個data.frame 有一個text欄位
# 
# Sys.time() - t0 #執行時間
# 
# #Time difference of 1.279373 hours
# 
# # 存儲資料
# save.image("coreNLP_BD.RData")

提取結果

(1). 斷詞、詞彙還原、詞性標註、NER

载入之前已经存好的coreNLP结果

# 把之前跑了一個半小時並存起來的data，重新載入
aa <- get(load("coreNLP_BD.RData")) # 30770

袪除掉裡面的一些空值（因為是空值的化，沒辦法用我們上面設定好的函數來解析）

先test一下

test = aa[780:800]

# Debug
# 終於把 length(sentences)==0 的刪掉了

n=c()

for(i in 1:length(test)){
  original_data <- test[[i]]$data
  doc <- test[[i]]$doc
  
  # for a sentences
  sentences <- doc$sentences
  if(length(sentences)==0) next # next 是可以直接跳過的，寫到function中的時候，只用做判斷就可以了
  
  # sen <- sentences[[1]]
  print(sprintf('number is %i', i))
}

## [1] "number is 1"
## [1] "number is 2"
## [1] "number is 3"
## [1] "number is 4"
## [1] "number is 5"
## [1] "number is 6"
## [1] "number is 7"
## [1] "number is 8"
## [1] "number is 9"
## [1] "number is 10"
## [1] "number is 12"
## [1] "number is 13"
## [1] "number is 14"
## [1] "number is 15"
## [1] "number is 16"
## [1] "number is 17"
## [1] "number is 18"
## [1] "number is 19"
## [1] "number is 20"
## [1] "number is 21"

# 跳過了沒有值的 11，達到我們想要的效果
# [1] "number is 1"
# [1] "number is 2"
# [1] "number is 3"
# [1] "number is 4"
# [1] "number is 5"
# [1] "number is 6"
# [1] "number is 7"
# [1] "number is 8"
# [1] "number is 9"
# [1] "number is 10"
# [1] "number is 12"

把test中的sentens是空值的rows刪掉了长度从21变为了20，我们成功将空值删除了

length(test)

## [1] 21

n=c()
for(i in 1:length(test)){
    original_data <- test[[i]]$data
    doc <- test[[i]]$doc
    
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)==0) 
    n=c(i)
    next
}

test=test[-n]
length(test)

## [1] 20

tokens =  coreNLP_tokens_parser(test)
head(tokens)

##             status_id  word lemma pos ner
## 1 1397129949706915850    So    so  RB   O
## 2 1397129949706915850   who   who  WP   O
## 3 1397129949706915850    is    be VBZ   O
## 4 1397129949706915850 ready ready  JJ   O
## 5 1397129949706915850    to    to  TO   O
## 6 1397129949706915850  make  make  VB   O

# 這下算是刪乾淨了
# 找出所有y值的index，然后将它删除（应用到后面就是将所有空值的index删除）
ddata = list("z","dd","y","ff","s","5","y","u")
n=c()
for(i in 1:length(ddata)){
    if(ddata[[i]] == "y") 
    n=c(n,i)
    next
}
n

## [1] 3 7

ddata = ddata[-n]
ddata

## [[1]]
## [1] "z"
## 
## [[2]]
## [1] "dd"
## 
## [[3]]
## [1] "ff"
## 
## [[4]]
## [1] "s"
## 
## [[5]]
## [1] "5"
## 
## [[6]]
## [1] "u"

對所有的data做上面👆的事情

length(aa)

## [1] 30770

n=c()
for(i in 1:length(aa)){
    original_data <- aa[[i]]$data
    doc <- aa[[i]]$doc
    
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)==0) 
    n=c(n,i) # 這種寫法就是類似python的append
    next
}
aa=aa[-n]
length(aa)

## [1] 30697

# 一共有3個空值
# [1] 30770
# [1] 30697

coreNLP_tokens_parser欄位:

status_id : 對應原本df裡的status_id，為一則tweets的唯一id
word: 原始斷詞
lemma : 對斷詞做詞形還原
pos : part-of-speech,詞性
ner: 命名實體

# 這邊也是跑了蠻久的所以將tokens保存在了BDtokens.csv中
# tokens =  coreNLP_tokens_parser(aa)
# head(tokens,20)
# write.csv(tokens,"BDtokens.csv", row.names = FALSE)

直接讀取之前解析好的tokens

tokens <- read.csv("BDtokens.csv")
head(tokens,20)

##       status_id       word      lemma pos    ner
## 1  1.398194e+18       Come       come  VB      O
## 2  1.398194e+18       here       here  RB      O
## 3  1.398194e+18        beb        beb NNP      O
## 4  1.398194e+18   Ethereum   Ethereum NNP      O
## 5  1.398194e+18      Price      Price NNP      O
## 6  1.398194e+18 Prediction Prediction NNP      O
## 7  1.398194e+18        ETH        ETH NNP      O
## 8  1.398194e+18       bids        bid NNS      O
## 9  1.398194e+18    goodbye    goodbye  NN      O
## 10 1.398194e+18         to         to  TO      O
## 11 1.398194e+18       3000       3000  CD NUMBER
## 12 1.398194e+18         as         as  IN      O
## 13 1.398194e+18    bullish    bullish  JJ      O
## 14 1.398194e+18   momentum   momentum  NN      O
## 15 1.398194e+18      fades       fade VBZ      O
## 16 1.398194e+18   Ethereum   Ethereum NNP      O
## 17 1.398194e+18 gracefully gracefully  RB      O
## 18 1.398194e+18       rose       rise VBD      O
## 19 1.398194e+18       from       from  IN      O
## 20 1.398194e+18        the        the  DT      O

NER分析

實體分析，我們可以知道參與討論的都有哪些國家、組織等

# 命名實體標註(NER)的類別
unique(tokens$ner)

##  [1] O                 NUMBER            DATE              ORGANIZATION     
##  [5] SET               MISC              NATIONALITY       CAUSE_OF_DEATH   
##  [9] STATE_OR_PROVINCE TITLE             LOCATION          PERSON           
## [13] CITY              DURATION          MONEY             COUNTRY          
## [17] ORDINAL           CRIMINAL_CHARGE   IDEOLOGY          RELIGION         
## [21] TIME              PERCENT           URL              
## 23 Levels: CAUSE_OF_DEATH CITY COUNTRY CRIMINAL_CHARGE DATE ... URL

#除去entity為Other，有多少種word有被標註entity
length(unique(tokens$word[tokens$ner != "O"]))

## [1] 16140

轉小寫

因為大小寫也會影響corenlp對NER的判斷，因此我們一開始給的推文內容是沒有處理大小寫的，但在跑完anotator後，為了正確計算詞頻，創建新欄位lower_word與lower_lemma，存放轉換小寫的word與lemma。轉成小寫的目的是要將不同大小寫的同一字詞（如Evergiven與evergiven）都換成小寫，再來計算詞頻

tokens$lower_word = tolower(tokens$word)
tokens$lower_lemma = tolower(tokens$lemma)
unique(tokens$ner)

##  [1] O                 NUMBER            DATE              ORGANIZATION     
##  [5] SET               MISC              NATIONALITY       CAUSE_OF_DEATH   
##  [9] STATE_OR_PROVINCE TITLE             LOCATION          PERSON           
## [13] CITY              DURATION          MONEY             COUNTRY          
## [17] ORDINAL           CRIMINAL_CHARGE   IDEOLOGY          RELIGION         
## [21] TIME              PERCENT           URL              
## 23 Levels: CAUSE_OF_DEATH CITY COUNTRY CRIMINAL_CHARGE DATE ... URL

國家的分析

所以討論最多的還是美國和中國

tokens %>%
  filter(ner == "COUNTRY") %>%  #篩選NER為COUNTRY
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 13, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is COUNTRY)") +
  theme(text=element_text(size=14))+
  coord_flip()

涉及到的組織(ORGANIZATION)

Cardano（ADA）是一個公共區塊鏈平台。比特幣（BTC）：對抗通脹、價值存儲、數位黃金以太幣（ETH）：加密貨幣世界的基礎設施、水電煤幣安幣（BNB）：為代表的平台幣：代表交易所股票 USDT 為代表的穩定幣：支付

tokens %>%
  filter(ner == "ORGANIZATION") %>%  #篩選NER為ORGANIZATION
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is ORGANIZATION)") +
  theme(text=element_text(size=14))+
  coord_flip()

涉及到的人物(PERSON)

查爾斯·霍斯金森（Charles Hoskinson）是Bitshares的聯合創始人卡爾達諾（Cardano）和以太坊（Ethereum）的聯合創始人，這兩個平台都是區塊鏈平台。以太坊第二層擴容方案 Polygon（MATIC）

tokens %>%
  filter(ner == "PERSON") %>%  #篩選NER為PERSON
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is PERSON)") +
  theme(text=element_text(size=14))+
  coord_flip()

情緒分數

了解文章的情緒

sentiment = coreNLP_sentiment_parser(aa)
head(sentiment,20)

##              status_id
## 1  1398193870828048387
## 2  1398193720130891776
## 3  1397439017344331784
## 4  1397514509137301508
## 5  1398193356619014145
## 6  1398193207457062913
## 7  1396902299117752325
## 8  1397935607356272646
## 9  1398193017706717185
## 10 1397978301428289547
## 11 1397828282368811008
## 12 1398183614198714368
## 13 1398187296713842691
## 14 1397943404424794117
## 15 1397833729914163200
## 16 1397949637932236800
## 17 1397869800244850690
## 18 1397754842190270464
## 19 1397671741309128706
## 20 1396988433831055362
##                                                                                                                                                                                                                                          text
## 1                                                                                                                                                                                                                               Come here beb
## 2                                                Ethereum Price Prediction ETH bids goodbye to 3000 as bullish momentum fades Ethereum gracefully rose from the support formed at 1750 to trade near 2800. The recovery was not unique to the
## 3                                                    Goldman Sachs Ethereum ETH Might Overtake Bitcoin BTC As A Store of Value Amid the recent crypto market volatility and the rising dominance of Ethereum ETH the debate of ETH vs BTC has
## 4                                                   Crypto Market Tanks 14 to 3Month Low Under 1.35 Trillion Ethereum ETH Under 2000 On Sunday May 23 the cryptocurrency market has extended its weekly losses further tanking another 14 and
## 5  Ha unu da belirteyimveyaay oldu mu diyenler var. Ay piyasasn aha ayya girdik demekle olmaz ki. Grafikte ay piyasasna girdin diyen indikatr veya teknik yok. nemli olan u birok alt coin. Ocak ubat fiyatna geldi. Bitcoin de ocak fiyatnda
## 6                                                                                                                                                                                                                             Awesome project
## 7                                                                                                                                                                     EK olarak EKL ile belirlenecek 5 kiiye  2er adet CAKE  hediye edeceiz..
## 8                       trx Bitcoin ve ETH baklrsa fiyat yl geneline gre halen daha uygun saylr olas bir yn hareketi belirterek destek ve direnleri izdim. Terste kalanlarn sabrla beklemesi gerekir en azndan salkl bir projede yatrmnz var.
## 9                                          xrp dn direnleri bugn destekleri konualm en yakn destek 84 seviyesi 2. destek 75 seviyesi.. Bugndzeltmesinin uzun soluklu olacan dnmyorum xrp ve eth nin 23 haftas daha var diye tahmin ediyorum..
## 10                        Bakn bence Elon Musk 1 ise Bu adam kesinlikle maniplasyon konusunda 10'dur. Bana att tweetler hi bir ekilde samimi gelmiyor.. Bu adam HODL diyorsa ben endie duyuyorum gerekten.. En byk maniplasyonu bunlar yapyor
## 11                                                 Piyasann yeilleri yakmas an meselesincesi minik bir dzeltme ardndan hafta sonu art depremleri ve nmzde gl bir ykseli dalgas nceki krlm noktalarndan coinler toplanabilir.. Analizler iin..
## 12                                                           Hafta sonuna girerken olabilecek dzeltmeler den korkmamal ama temkinli olmak lazm stop kullanmay ihmal etmeyin. Ayrca aa dip grdnz destek noktalarna alm oltalar girebilirsiniz.
## 13                                                                                   Bugn tekrar herkes elinde terste kalan coinleri yazsn tek tek bakalm gn ierisinde grafikler ne durumdaizin verdii srece grafikler zerinde yorum yaparz..
## 14                                                   Hangi coin veya coinlerden terste kaldnz? Bu yl en ok hangi coinden kazandnz? Kar damsnz? Zarar damsnz? Yakn bir sre ierisinde geri dnm iin gzel bir alma balatacaz lk ders kaybetmemek.
## 15                            crv Dn sorulan coinlerden biri alan kiiler maliyetlerini belirtebilirler yorumu ona gre yapabilirim. ATH seviyesinden ok geride TOBO oluumu bir ok coinde gzlemlenebilir nmzdeki hafta. Daha fazla Analiz iin..
## 16                                        pnt ilk ekleme noktas 0.87 seviyesi kademeli bir ekilde alma uygun cake eer ykseliini srdrmeye devam ederse pnt fiyat ok ucuz kalacak.. kademeli hedefler belirtilmitir ''Pump coini'' olduu aikar.
## 17                                                                 Hangi hatay yaptn tam olarak zemiyorsan ayn ekilde piyasa ile alakal i skntn telegram grubumuzda da dile getirebilirsiniz. Her eyden nce Psikolojinizi korumak zorundasnz.
## 18                                                                                                                                                                                        Buy BTC With Your Credit Card  Fast Crypto Exchange
## 19                                                                                                                                                                                        Buy BTC With Your Credit Card  Fast Crypto Exchange
## 20                                                                                                                                                                                     Buy Crypto With Your Credit Card  Fast Crypto Exchange
##       sentiment sentimentValue
## 1       Neutral              2
## 2      Positive              3
## 3      Positive              3
## 4      Negative              1
## 5      Positive              3
## 6  Verypositive              4
## 7       Neutral              2
## 8       Neutral              2
## 9      Positive              3
## 10      Neutral              2
## 11      Neutral              2
## 12      Neutral              2
## 13      Neutral              2
## 14      Neutral              2
## 15      Neutral              2
## 16      Neutral              2
## 17      Neutral              2
## 18      Neutral              2
## 19      Neutral              2
## 20      Neutral              2

unique(sentiment$sentiment)

## [1] Neutral      Positive     Negative     Verypositive Verynegative
## Levels: Neutral Positive Negative Verypositive Verynegative

sentiment$sentimentValue <- sentiment$sentimentValue %>% 
  as.numeric
#了解情緒文章的分佈
sentiment$sentiment %>% 
  table()

## .
##      Neutral     Positive     Negative Verypositive Verynegative 
##        20875         5497         4119          195           11

# 所以其實只有這裏是把時間的格式轉換了一下的
BD_df$date = as.Date(BD_df$created_at)

sentiment %>% 
  merge(BD_df[,c("status_id","source","date")]) %>%
  group_by(date) %>% 
  summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>% 
  ggplot(aes(x=date,y=avg_sentiment)) + 
  geom_line()

# 這邊的時間還是只有年月日，我想要分鐘的資料
# 這邊是選出來分鐘的資料
senti_df <- sentiment%>%
  merge(BD_date)

senti_df$created_at = as.Date(senti_df$created_at)
head(senti_df)

##             status_id
## 1 1395375895532146699
## 2 1395376077246238724
## 3 1395376411603443720
## 4 1395376681573965825
## 5 1395376684514357251
## 6 1395376687014109185
##                                                                                                                                                                 text
## 1 If Shibu ever challenges Doge in the future it may be the battle of the century between cryptos.Crypto semantics make this a plausible reality in the near future.
## 2                                                                                                                                      hbar btc eth xrp vet ada doge
## 3                                                                                                                            What Is Internet Computer Protocol ICP?
## 4                                                                                                                Buy BTC With Your Credit Card  Fast Crypto Exchange
## 5                                                                                                        These ExJournalists Are Using AI to Catch Online Defamation
## 6                                                                                                          The Long Journey of Usoni an African Postapocalyptic Game
##   sentiment sentimentValue created_at             source      Dates Hours
## 1  Negative              3 2021-05-20 Twitter for iPhone 2021-05-20    13
## 2   Neutral              1 2021-05-20    Twitter Web App 2021-05-20    13
## 3   Neutral              1 2021-05-20    Twitter Web App 2021-05-20    13
## 4   Neutral              1 2021-05-20         CoinTweety 2021-05-20    13
## 5   Neutral              1 2021-05-20              IFTTT 2021-05-20    13
## 6   Neutral              1 2021-05-20              IFTTT 2021-05-20    13
##   Minutes
## 1      48
## 2      49
## 3      50
## 4      51
## 5      51
## 6      51

先看一下隨著time的變化情緒的變化（優點太密集了）

因為分鐘實在是切的有一點太細了，所以我們現在改成看時間的

hhhs <- senti_df %>% 
  group_by(Dates, Hours) %>% 
  dplyr::summarise(avg_sentiment = mean(sentimentValue,na.rm=T))

## `summarise()` has grouped output by 'Dates'. You can override using the `.groups` argument.

head(hhhs)

## # A tibble: 6 x 3
## # Groups:   Dates [1]
##   Dates      Hours avg_sentiment
##   <chr>      <chr>         <dbl>
## 1 2021-05-20 13             1.51
## 2 2021-05-20 14             1.54
## 3 2021-05-20 15             1.55
## 4 2021-05-20 16             1.53
## 5 2021-05-20 17             1.68
## 6 2021-05-20 18             1.53

# 合併小時和分鐘，方便之後畫圖
hhhs$time <- paste(hhhs$Dates, hhhs$Hours)
# hhhs$time <- as.POSIXct(paste(hhhs$time, hhhs$Minutes), format="%Y-%m-%d %H:%M")
# hhhs$time <- as.POSIXct(hhhs$time, format="%Y-%m-%d %H:%M:%S")
hhhs$times <- as.POSIXct(hhhs$time,format="%Y-%m-%d %H")
head(hhhs)

## # A tibble: 6 x 5
## # Groups:   Dates [1]
##   Dates      Hours avg_sentiment time          times              
##   <chr>      <chr>         <dbl> <chr>         <dttm>             
## 1 2021-05-20 13             1.51 2021-05-20 13 2021-05-20 13:00:00
## 2 2021-05-20 14             1.54 2021-05-20 14 2021-05-20 14:00:00
## 3 2021-05-20 15             1.55 2021-05-20 15 2021-05-20 15:00:00
## 4 2021-05-20 16             1.53 2021-05-20 16 2021-05-20 16:00:00
## 5 2021-05-20 17             1.68 2021-05-20 17 2021-05-20 17:00:00
## 6 2021-05-20 18             1.53 2021-05-20 18 2021-05-20 18:00:00

情緒分數從最低分0~最高分4 + 0,1 : very negative,negative + 2 : neutral + 3,4 : very positive,postive

hhhs %>% 
  ggplot(aes(x=times,y=avg_sentiment)) + 
  geom_line()

# 把bitcoin價錢做了表轉化之後和情緒合併到一張表上
mergedata <- open_data %>%
  merge(hhhs, by.x="Date", by.y="times")
  # ggplot() +
  # theme_minimal() +
  # theme(plot.title = element_text(face = "bold")) +
  # geom_line(aes(x=Date,y=avg_sentiment),col = '#ffa500') + 
mergedata$scaleOpen <- scale(mergedata$Open,center = T,scale = T)
mergedata$scaleAS <- scale(mergedata$avg_sentiment,center = T,scale = T)

head(mergedata)

##                  Date     Open      Dates Hours avg_sentiment          time
## 1 2021-05-20 14:00:00 41835.51 2021-05-20    14      1.538462 2021-05-20 14
## 2 2021-05-20 15:00:00 41547.04 2021-05-20    15      1.551724 2021-05-20 15
## 3 2021-05-20 16:00:00 41526.04 2021-05-20    16      1.534188 2021-05-20 16
## 4 2021-05-20 17:00:00 39383.56 2021-05-20    17      1.677273 2021-05-20 17
## 5 2021-05-20 18:00:00 39893.33 2021-05-20    18      1.527950 2021-05-20 18
## 6 2021-05-20 19:00:00 39920.72 2021-05-20    19      1.439759 2021-05-20 19
##   scaleOpen    scaleAS
## 1 1.9934591  0.6232510
## 2 1.8431808  0.7405717
## 3 1.8322409  0.5854476
## 4 0.7161177  1.8511723
## 5 0.9816819  0.5302689
## 6 0.9959507 -0.2498696

ggplot(data = mergedata)+
  geom_line(aes(x=Date,y=scaleOpen))+
  geom_line(aes(x=Date,y=scaleAS),linetype="dotted")

討論主題分析

LDA 主題分類

根據詞頻，選擇只出現3字以上的字
整理成url,word,n的格式之後，就可以轉dtm

P.S. groupby by之後原本的字詞結構會不見，把詞頻另存在一個reserved_word裡面

head(tokens)

##      status_id       word      lemma pos ner lower_word lower_lemma
## 1 1.398194e+18       Come       come  VB   O       come        come
## 2 1.398194e+18       here       here  RB   O       here        here
## 3 1.398194e+18        beb        beb NNP   O        beb         beb
## 4 1.398194e+18   Ethereum   Ethereum NNP   O   ethereum    ethereum
## 5 1.398194e+18      Price      Price NNP   O      price       price
## 6 1.398194e+18 Prediction Prediction NNP   O prediction  prediction

# stop_words <- scan(file = "./dict/stopwords.txt", what=character(),sep='\n', 
#                    encoding='utf-8',fileEncoding='utf-8')
# head(stop_words)

# 可以直接載入 tidytext 中預設的 stopwordx
data("stop_words")
head(stop_words)

## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

# 篩選出我們需要的tokens data格式
new_tokens <- tokens %>%
  select(status_id, lower_word) %>%
  anti_join(stop_words, by = c("lower_word" = "word")) %>% # 去掉停用字裡的一些詞彙
  filter((!str_detect(lower_word, regex("[0-9]")))) %>%
  count(status_id, lower_word) %>%
  rename(count=n)

head(new_tokens)

##      status_id     lower_word count
## 1 1.395376e+18         battle     1
## 2 1.395376e+18        century     1
## 3 1.395376e+18     challenges     1
## 4 1.395376e+18 cryptos.crypto     1
## 5 1.395376e+18           doge     1
## 6 1.395376e+18         future     1

library(tm)
library(topicmodels)
library(purrr)
library(reshape2)
library(tidyr)

# Build graph data
library(tidygraph)
library(igraph) 
library(ggraph)

將資料轉換為Document Term Matrix (DTM)

freq = 3

# 依據字頻挑字
reserved_word <- new_tokens %>% 
  group_by(lower_word) %>% 
  count() %>% 
  filter(n > freq) %>% 
  unlist()

mask_removed <- new_tokens %>% 
  filter(lower_word %in% reserved_word)

# mask_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
mask_dtm <- mask_removed %>% 
  cast_dtm(status_id, lower_word, count) 

inspect(mask_dtm[1:10,1:10])

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 17/83
## Sparsity           : 83%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##                      Terms
## Docs                  ada battle btc doge eth future future. hbar reality vet
##   1395375895532146688   0      1   0    1   0      1       1    0       1   0
##   1395376077246238720   1      0   1    1   1      0       0    1       0   1
##   1395376411603443712   0      0   0    0   0      0       0    0       0   0
##   1395376681573965824   0      0   1    0   0      0       0    0       0   0
##   1395376684514357248   0      0   0    0   0      0       0    0       0   0
##   1395376687014109184   0      0   0    0   0      0       0    0       0   0
##   1395376690839408640   0      0   0    0   0      0       0    0       0   0
##   1395376809823412224   0      0   1    0   0      0       0    0       0   0
##   1395376818476171264   0      0   0    0   0      0       0    0       0   0
##   1395376894783205376   1      0   1    1   1      0       0    0       0   0

建立LDA模型

# 去掉矩陣中全部是0的行
raw.sum <- apply(mask_dtm,1,FUN=sum) #sum by raw each raw of the table
mask_dtm <- mask_dtm[raw.sum!=0,]

# 建立模型
# mask_lda <- LDA(mask_dtm, k = 4, control = list(seed = 123))

透過perplexity找到最佳主題數

ldas = c()
topics = c(2,4,6,8)

for(topic in topics){
  start_time <- Sys.time()
  lda <- LDA(mask_dtm, k = topic, control = list(seed = 2021))
  ldas <- c(ldas,lda)
  print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
  save(ldas,file = "ldas_result.rdata") # 將模型輸出成檔案
}

## [1] "2 topic(s) and use time is  0.930460929870605"
## [1] "4 topic(s) and use time is  24.2873389720917"
## [1] "6 topic(s) and use time is  45.7744100093842"
## [1] "8 topic(s) and use time is  56.1082332134247"

topics = c(2,4,6,8)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
  ggplot(aes(k, perplex)) +
  geom_point() +
  geom_line() +
  labs(title = "Evaluating LDA topic models",
       subtitle = "Optimal number of topics (smaller is better)",
       x = "Number of topics",
       y = "Perplexity")

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

上圖選擇 4 個 topic

取出代表字詞(term)

# 建立模型
mask_lda <- LDA(mask_dtm, k = 4, control = list(seed = 123))

removed_word = c("bitcoin","eth") 

# 看各群的常用詞彙
tidy(mask_lda, matrix = "beta") %>% # 取出topic term beta值
  filter(! term %in% removed_word) %>% 
  group_by(topic) %>%
  top_n(10, beta) %>% # beta值前10的字
  ungroup() %>%
  mutate(topic = as.factor(topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

topic 1 = “ ” topic 2 = “ ” topic 3 = “ ” topic 4 = “ ”

資料內容探索

# 在tidy function中使用參數"gamma"來取得 theta矩陣
mask_topics <- tidy(mask_lda, matrix="gamma") %>% # document topic gamma
                  group_by(document) %>%
                  top_n(1, wt=gamma)
mask_topics

## # A tibble: 29,745 x 3
## # Groups:   document [29,689]
##    document            topic gamma
##    <chr>               <int> <dbl>
##  1 1395376809823412224     1 0.875
##  2 1395377181518401536     1 0.820
##  3 1395377590454702080     1 0.587
##  4 1395377780527796224     1 0.985
##  5 1395378767145676800     1 0.977
##  6 1395378816768434176     1 0.972
##  7 1395379083815489536     1 0.985
##  8 1395381293991464960     1 0.844
##  9 1395382161499475968     1 0.898
## 10 1395382987567554560     1 0.989
## # … with 29,735 more rows

posts_topic <- merge(x = BD_df, y = mask_topics, by.x = "status_id", by.y="document")

# 看一下各主題在說甚麼
set.seed(123)
posts_topic %>% # 主題二
  filter(topic==2) %>%
  select(text) %>%
  unique() %>%
  sample_n(5)

##                                                                                                                                                                                                               text
## 1                                                                                                                                  We fly to mars!!!eth With infinite burninggains value every dayeth in actioneth
## 2                                                                                                                              Come learnonwith aor! How to Buy Cryptocurrency on Bitrue With Debit or Credit Card
## 3                                                    Speklatif hareketlerparalarn gelecein teknolojisi olmasn vein gvenilirliini asla deitirmeyecektir.gelecein teknolojisinde ba ekmektedir.nun da dedii gibi ''.
## 4                                                                                                                                                      Old Masters Art Auguste Renoir Georges Rivire 1877. Buy at 
## 5 Grnm alalan kama 4h periyotta RSI da pozitif uyumsuzluk var ve 384 e kadar ykseli yapabilir.nemli olan Kama nn dna trendi krp kabilmesidir. 384 de kp kramazsa kama desteine yeniden iner destekdeki izlenir bnb

各個日期的主題分佈

topics_name = c("1","2","3","4")

posts_topic %>%
  mutate(artDate = as.Date(created_at)) %>% 
  group_by(artDate,topic) %>%
  summarise(sum =sum(topic)) %>%
  ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
  geom_col()

## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

posts_topic %>%
  mutate(artDate = as.Date(created_at)) %>% 
  group_by(artDate,topic) %>%
  summarise(sum =sum(topic)) %>%
  ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
  geom_col(position="fill")

## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

把廣告的帖子刪掉再來分析一遍

一大部分的 Twitter 是廣告，我們要將其篩選出來。一般正常用戶不會重複留言，所以我們把多次留言的去掉在進行網絡分析。

# 我們先查看一下留言數量的分佈，發現大部分都是1-2條，所以刪掉大於2條的
df_ads <- BD_df %>%
  group_by(user_id) %>%
  summarize(count = n()) %>%
  group_by(count)%>%
  filter(count<10)%>%
  summarize(numbers =  n())%>%
  ggplot(aes(x=count, y=numbers)) +
  geom_point() +
  geom_line()
 
df_ads