bitcoin

動機: 雖然有越來越多的人用機器學習和深度學習的方式來預測股市和比特幣的走勢。但我一直認為市場瞬息萬變,用過去的資料來預測未來漲跌實在不太可靠。那麼在這麼多變因中有沒有什麼是不變的,大概是是「人性」吧。情緒也許可以反應出來一二。 比特幣在5月23日的時候有一個明顯的下跌,所以想要分析在下跌的這一時間附近社群平台的討論都有什麼,同時情緒是怎樣的。

假設:

  • 是否有提前的負面情緒的釋出?
  • 情緒變化和價格變動是否有一定的相關性?

bitcoin 的歷史价格數據

bitcoin的歷史數據:

  • 網站讀取:his_data <- read.table()
  • 修改日期格式:as.POSIXct(his_data$Date)
  • 儲存為 bitcoin_his_data.csv
##   Unix.Timestamp                Date Symbol     Open     High      Low    Close
## 1   1.625098e+12 2021-07-01 00:01:00 BTCUSD 35032.65 35060.60 35032.65 35060.60
## 2   1.625098e+12 2021-07-01 00:00:00 BTCUSD 35021.00 35137.79 35021.00 35032.65
## 3   1.625098e+12 2021-06-30 23:59:00 BTCUSD 35079.71 35079.71 35021.00 35021.00
## 4   1.625097e+12 2021-06-30 23:58:00 BTCUSD 35070.96 35090.38 35069.07 35079.71
## 5   1.625097e+12 2021-06-30 23:57:00 BTCUSD 35054.06 35076.16 35047.26 35070.96
## 6   1.625097e+12 2021-06-30 23:56:00 BTCUSD 35059.21 35064.30 35042.63 35054.06
##       Volume
## 1 0.02096801
## 2 0.15234732
## 3 0.09060078
## 4 0.02939092
## 5 0.37807248
## 6 0.37451313
##   Unix.Timestamp                Date Symbol     Open     High      Low    Close
## 1   1.625098e+12 2021-07-01 00:01:00 BTCUSD 35032.65 35060.60 35032.65 35060.60
## 2   1.625098e+12 2021-07-01 00:00:00 BTCUSD 35021.00 35137.79 35021.00 35032.65
## 3   1.625098e+12 2021-06-30 23:59:00 BTCUSD 35079.71 35079.71 35021.00 35021.00
## 4   1.625097e+12 2021-06-30 23:58:00 BTCUSD 35070.96 35090.38 35069.07 35079.71
## 5   1.625097e+12 2021-06-30 23:57:00 BTCUSD 35054.06 35076.16 35047.26 35070.96
## 6   1.625097e+12 2021-06-30 23:56:00 BTCUSD 35059.21 35064.30 35042.63 35054.06
##       Volume
## 1 0.02096801
## 2 0.15234732
## 3 0.09060078
## 4 0.02939092
## 5 0.37807248
## 6 0.37451313
##                  Date     Open
## 1 2021-06-19 08:26:00 35745.18
## 2 2021-06-19 08:25:00 35699.35
## 3 2021-06-19 08:24:00 35721.90
## 4 2021-06-19 08:23:00 35733.24
## 5 2021-06-19 08:22:00 35712.97
## 6 2021-06-19 08:21:00 35693.15

bitcoin 的價格資料

  1. 以Public API下載匯率 Bitcoin屬於permissionless的blockchain,任何人都可以使用Public API下載匯率資料:
##    market base quote           timestamp market_timestamp    last     vwap
## 1: kraken  BTC   EUR 2021-07-01 22:10:10             <NA> 28329.1 28751.47
##      volume     ask     bid
## 1: 3123.995 28336.4 28329.8

bitcoin 的 tweet data

有關Twitter上bitcoin的討論資料是用tweet連結Twitter API,以ETH和bitcoin為關鍵詞抓取的,只能抓取近10天的資料。

情緒分析

有关 load data 的资料:https://www.r-bloggers.com/2019/05/how-to-save-and-load-datasets-in-r-an-overview/

## # A tibble: 6 x 90
##   user_id   status_id   created_at          screen_name  text            source 
##   <chr>     <chr>       <dttm>              <chr>        <chr>           <chr>  
## 1 88987226… 1398193870… 2021-05-28 08:26:10 JojorMaria   "@CryptoPoorBo… Twitte…
## 2 14207956… 1398193720… 2021-05-28 08:25:34 Bitcoin_Win  "Ethereum Pric… Revive…
## 3 14207956… 1397439017… 2021-05-26 06:26:39 Bitcoin_Win  "Goldman Sachs… Revive…
## 4 14207956… 1397514509… 2021-05-26 11:26:38 Bitcoin_Win  "Crypto Market… Revive…
## 5 85748594… 1398193356… 2021-05-28 08:24:08 borsa_sabir… "Ha şunu da be… Twitte…
## 6 13671238… 1398193207… 2021-05-28 08:23:32 OladipoAde1  "@AirdropStari… Twitte…
## # … with 84 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
## #   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## #   is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## #   quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
## #   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
## #   media_url <list>, media_t.co <list>, media_expanded_url <list>,
## #   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
## #   ext_media_expanded_url <list>, ext_media_type <chr>,
## #   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
## #   quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
## #   quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

添加這篇Twitter的url,方便之後在做主題模型分析的時候可以比較方便的回溯,(例如找到最能代表某個主題的貼文)

了解資料的資料筆數以及時間分布

created_at已經是一個date類型的欄位,因此可以直接用min,max來看最遠或最近的日期 註:rtweet最多只能抓到距今10天的資料

## [1] "2021-05-20 13:48:32 UTC"
## [1] "2021-05-28 08:26:10 UTC"

tweets內容清理

df共有90個欄位,但我們在這裡僅會使用幾個欄位:

user_id: 用戶id status_id : 推文id created_at : 發文時間 screen_name:網名 text : 推文內容 source : 發文來源 favorite_count:按讚數 retweet_count:回覆數量 Tweet_url:原始推文的url

## # A tibble: 6 x 9
##   user_id  status_id created_at          screen_name text  source favorite_count
##   <chr>    <chr>     <dttm>              <chr>       <chr> <chr>           <int>
## 1 8898722… 13981938… 2021-05-28 08:26:10 JojorMaria  Come… Twitt…              0
## 2 1420795… 13981937… 2021-05-28 08:25:34 Bitcoin_Win Ethe… Reviv…              0
## 3 1420795… 13974390… 2021-05-26 06:26:39 Bitcoin_Win Gold… Reviv…              1
## 4 1420795… 13975145… 2021-05-26 11:26:38 Bitcoin_Win Cryp… Reviv…              0
## 5 8574859… 13981933… 2021-05-28 08:24:08 borsa_sabi… Ha u… Twitt…              0
## 6 1367123… 13981932… 2021-05-28 08:23:32 OladipoAde1 Awes… Twitt…              1
## # … with 2 more variables: retweet_count <int>, Tweet_url <glue>

把日期、小時、分鐘分開

## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
## # A tibble: 6 x 6
##   status_id         created_at          source            Dates    Hours Minutes
##   <chr>             <dttm>              <chr>             <chr>    <chr> <chr>  
## 1 1398193870828048… 2021-05-28 08:26:10 Twitter for Andr… 2021-05… 08    26     
## 2 1398193720130891… 2021-05-28 08:25:34 Revive Social App 2021-05… 08    25     
## 3 1397439017344331… 2021-05-26 06:26:39 Revive Social App 2021-05… 06    26     
## 4 1397514509137301… 2021-05-26 11:26:38 Revive Social App 2021-05… 11    26     
## 5 1398193356619014… 2021-05-28 08:24:08 Twitter Web App   2021-05… 08    24     
## 6 1398193207457062… 2021-05-28 08:23:32 Twitter for Andr… 2021-05… 08    23

串連 CoreNLP

(1). API呼叫的設定 server端呼叫伺服器 :

  • 需先在terminal開啟corenlp server輸入:stanford-corenlp-4.2.0
  • 在corenlp的路徑下開啟terminal輸入 :java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

解析抽取資料的functions

  • coreNLP_tokens_parser:解析 coreNLP 回傳的資料(爬蟲解析的部分),抽取出斷詞的結果
    • do.call:do.call(what, args, quote = FALSE, envir = parent.frame()) 的功能就是執行一個函數,而這個函數的參數呢,是list的每個子元素。
    • do.call + rbind:Parse Parse list of dataframes with
  • coreNLP_dependency_parser:抽取出依存關係的結果
  • coreNLP_sentiment_parser:抽取出語句情緒的結果
# 從回傳的object中整理斷詞出結果,輸出為 tidydata 格式
# 其實還厚很多的欄位,之後需要的話可以加入到這個function中 coreNLP_tokens_parser
coreNLP_tokens_parser <- function(coreNLP_objects){
  
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)!=0) # 新加入的條件
    sen <- sentences[[1]]
    
    # 抽取出斷詞出結果:tokens
    tokens <- do.call(rbind, lapply(sen$tokens, function(x){
      result <- data.frame(word=x$word, lemma=x$lemma, pos=x$pos, ner=x$ner)
      result
    }))
    
    tokens <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(tokens))) %>% 
      bind_cols(tokens)
    
    tokens
  }))
  return(result)
}

# 從回傳的core-nlp object中整理出詞彙依存關係,輸出為 tidydata 格式
coreNLP_dependency_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)!=0) # 新加入的條件
    sen <- sentences[[1]]
    
    # 抽取出依存關係:Dependencies
    dependencies <- do.call(rbind, lapply(sen$basicDependencies, function(x){
      result <- data.frame(dep=x$dep, governor=x$governor, governorGloss=x$governorGloss, dependent=x$dependent, dependentGloss=x$dependentGloss)
      result
    }))
    
    dependencies <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(dependencies))) %>% 
      bind_cols(dependencies)
    dependencies
  }))
  return(result)
}

# 從回傳的core-nlp object中整理出語句情緒,輸出為 tidydata 格式
coreNLP_sentiment_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    if(length(sentences)!=0) # 新加入的條件
    sen <- sentences[[1]]
    
    # 抽取出情緒:sentiment
    sentiment <- original_data %>%
      t() %>% 
      data.frame() %>% 
      bind_cols(data.frame(sentiment=sen$sentiment, sentimentValue=sen$sentimentValue))
  
    sentiment
  }))
  return(result)
}

資料可視化的function

  • parse2tree:用樹狀圖的方式可視化依存關係
## # A tibble: 6 x 2
##   status_id         text                                                        
##   <chr>             <chr>                                                       
## 1 1398193870828048… Come here beb                                               
## 2 1398193720130891… Ethereum Price Prediction ETH bids goodbye to 3000 as bulli…
## 3 1397439017344331… Goldman Sachs Ethereum ETH Might Overtake Bitcoin BTC As A …
## 4 1397514509137301… Crypto Market Tanks 14 to 3Month Low Under 1.35 Trillion Et…
## 5 1398193356619014… Ha unu da belirteyimveyaay oldu mu diyenler var. Ay piyasas…
## 6 1398193207457062… Awesome project

提取結果

(1). 斷詞、詞彙還原、詞性標註、NER

载入之前已经存好的coreNLP结果

袪除掉裡面的一些空值(因為是空值的化,沒辦法用我們上面設定好的函數來解析)

先test一下

## [1] "number is 1"
## [1] "number is 2"
## [1] "number is 3"
## [1] "number is 4"
## [1] "number is 5"
## [1] "number is 6"
## [1] "number is 7"
## [1] "number is 8"
## [1] "number is 9"
## [1] "number is 10"
## [1] "number is 12"
## [1] "number is 13"
## [1] "number is 14"
## [1] "number is 15"
## [1] "number is 16"
## [1] "number is 17"
## [1] "number is 18"
## [1] "number is 19"
## [1] "number is 20"
## [1] "number is 21"

把test中的sentens是空值的rows刪掉了 长度从21变为了20,我们成功将空值删除了

## [1] 21
## [1] 20
##             status_id  word lemma pos ner
## 1 1397129949706915850    So    so  RB   O
## 2 1397129949706915850   who   who  WP   O
## 3 1397129949706915850    is    be VBZ   O
## 4 1397129949706915850 ready ready  JJ   O
## 5 1397129949706915850    to    to  TO   O
## 6 1397129949706915850  make  make  VB   O
## [1] 3 7
## [[1]]
## [1] "z"
## 
## [[2]]
## [1] "dd"
## 
## [[3]]
## [1] "ff"
## 
## [[4]]
## [1] "s"
## 
## [[5]]
## [1] "5"
## 
## [[6]]
## [1] "u"

對所有的data做上面👆的事情

## [1] 30770
## [1] 30697

coreNLP_tokens_parser欄位:

  • status_id : 對應原本df裡的status_id,為一則tweets的唯一id
  • word: 原始斷詞
  • lemma : 對斷詞做詞形還原
  • pos : part-of-speech,詞性
  • ner: 命名實體

直接讀取之前解析好的tokens

##       status_id       word      lemma pos    ner
## 1  1.398194e+18       Come       come  VB      O
## 2  1.398194e+18       here       here  RB      O
## 3  1.398194e+18        beb        beb NNP      O
## 4  1.398194e+18   Ethereum   Ethereum NNP      O
## 5  1.398194e+18      Price      Price NNP      O
## 6  1.398194e+18 Prediction Prediction NNP      O
## 7  1.398194e+18        ETH        ETH NNP      O
## 8  1.398194e+18       bids        bid NNS      O
## 9  1.398194e+18    goodbye    goodbye  NN      O
## 10 1.398194e+18         to         to  TO      O
## 11 1.398194e+18       3000       3000  CD NUMBER
## 12 1.398194e+18         as         as  IN      O
## 13 1.398194e+18    bullish    bullish  JJ      O
## 14 1.398194e+18   momentum   momentum  NN      O
## 15 1.398194e+18      fades       fade VBZ      O
## 16 1.398194e+18   Ethereum   Ethereum NNP      O
## 17 1.398194e+18 gracefully gracefully  RB      O
## 18 1.398194e+18       rose       rise VBD      O
## 19 1.398194e+18       from       from  IN      O
## 20 1.398194e+18        the        the  DT      O

NER分析

實體分析,我們可以知道參與討論的都有哪些國家、組織等

##  [1] O                 NUMBER            DATE              ORGANIZATION     
##  [5] SET               MISC              NATIONALITY       CAUSE_OF_DEATH   
##  [9] STATE_OR_PROVINCE TITLE             LOCATION          PERSON           
## [13] CITY              DURATION          MONEY             COUNTRY          
## [17] ORDINAL           CRIMINAL_CHARGE   IDEOLOGY          RELIGION         
## [21] TIME              PERCENT           URL              
## 23 Levels: CAUSE_OF_DEATH CITY COUNTRY CRIMINAL_CHARGE DATE ... URL
## [1] 16140

轉小寫

因為大小寫也會影響corenlp對NER的判斷,因此我們一開始給的推文內容是沒有處理大小寫的,但在跑完anotator後,為了正確計算詞頻,創建新欄位lower_word與lower_lemma,存放轉換小寫的word與lemma。轉成小寫的目的是要將不同大小寫的同一字詞(如Evergiven與evergiven)都換成小寫,再來計算詞頻

##  [1] O                 NUMBER            DATE              ORGANIZATION     
##  [5] SET               MISC              NATIONALITY       CAUSE_OF_DEATH   
##  [9] STATE_OR_PROVINCE TITLE             LOCATION          PERSON           
## [13] CITY              DURATION          MONEY             COUNTRY          
## [17] ORDINAL           CRIMINAL_CHARGE   IDEOLOGY          RELIGION         
## [21] TIME              PERCENT           URL              
## 23 Levels: CAUSE_OF_DEATH CITY COUNTRY CRIMINAL_CHARGE DATE ... URL

國家的分析

所以討論最多的還是美國和中國

涉及到的組織(ORGANIZATION)

Cardano(ADA)是一個公共區塊鏈平台。 比特幣(BTC):對抗通脹、價值存儲、數位黃金 以太幣(ETH):加密貨幣世界的基礎設施、水電煤 幣安幣(BNB):為代表的平台幣:代表交易所股票 USDT 為代表的穩定幣:支付

涉及到的人物(PERSON)

查爾斯·霍斯金森(Charles Hoskinson)是Bitshares的聯合創始人 卡爾達諾(Cardano)和以太坊(Ethereum)的聯合創始人,這兩個平台都是區塊鏈平台。 以太坊第二層擴容方案 Polygon(MATIC)

情緒分數

了解文章的情緒

##              status_id
## 1  1398193870828048387
## 2  1398193720130891776
## 3  1397439017344331784
## 4  1397514509137301508
## 5  1398193356619014145
## 6  1398193207457062913
## 7  1396902299117752325
## 8  1397935607356272646
## 9  1398193017706717185
## 10 1397978301428289547
## 11 1397828282368811008
## 12 1398183614198714368
## 13 1398187296713842691
## 14 1397943404424794117
## 15 1397833729914163200
## 16 1397949637932236800
## 17 1397869800244850690
## 18 1397754842190270464
## 19 1397671741309128706
## 20 1396988433831055362
##                                                                                                                                                                                                                                          text
## 1                                                                                                                                                                                                                               Come here beb
## 2                                                Ethereum Price Prediction ETH bids goodbye to 3000 as bullish momentum fades Ethereum gracefully rose from the support formed at 1750 to trade near 2800. The recovery was not unique to the
## 3                                                    Goldman Sachs Ethereum ETH Might Overtake Bitcoin BTC As A Store of Value Amid the recent crypto market volatility and the rising dominance of Ethereum ETH the debate of ETH vs BTC has
## 4                                                   Crypto Market Tanks 14 to 3Month Low Under 1.35 Trillion Ethereum ETH Under 2000 On Sunday May 23 the cryptocurrency market has extended its weekly losses further tanking another 14 and
## 5  Ha unu da belirteyimveyaay oldu mu diyenler var. Ay piyasasn aha ayya girdik demekle olmaz ki. Grafikte ay piyasasna girdin diyen indikatr veya teknik yok. nemli olan u birok alt coin. Ocak ubat fiyatna geldi. Bitcoin de ocak fiyatnda
## 6                                                                                                                                                                                                                             Awesome project
## 7                                                                                                                                                                     EK olarak EKL ile belirlenecek 5 kiiye  2er adet CAKE  hediye edeceiz..
## 8                       trx Bitcoin ve ETH baklrsa fiyat yl geneline gre halen daha uygun saylr olas bir yn hareketi belirterek destek ve direnleri izdim. Terste kalanlarn sabrla beklemesi gerekir en azndan salkl bir projede yatrmnz var.
## 9                                          xrp dn direnleri bugn destekleri konualm en yakn destek 84 seviyesi 2. destek 75 seviyesi.. Bugndzeltmesinin uzun soluklu olacan dnmyorum xrp ve eth nin 23 haftas daha var diye tahmin ediyorum..
## 10                        Bakn bence Elon Musk 1 ise Bu adam kesinlikle maniplasyon konusunda 10'dur. Bana att tweetler hi bir ekilde samimi gelmiyor.. Bu adam HODL diyorsa ben endie duyuyorum gerekten.. En byk maniplasyonu bunlar yapyor
## 11                                                 Piyasann yeilleri yakmas an meselesincesi minik bir dzeltme ardndan hafta sonu art depremleri ve nmzde gl bir ykseli dalgas nceki krlm noktalarndan coinler toplanabilir.. Analizler iin..
## 12                                                           Hafta sonuna girerken olabilecek dzeltmeler den korkmamal ama temkinli olmak lazm stop kullanmay ihmal etmeyin. Ayrca aa dip grdnz destek noktalarna alm oltalar girebilirsiniz.
## 13                                                                                   Bugn tekrar herkes elinde terste kalan coinleri yazsn tek tek bakalm gn ierisinde grafikler ne durumdaizin verdii srece grafikler zerinde yorum yaparz..
## 14                                                   Hangi coin veya coinlerden terste kaldnz? Bu yl en ok hangi coinden kazandnz? Kar damsnz? Zarar damsnz? Yakn bir sre ierisinde geri dnm iin gzel bir alma balatacaz lk ders kaybetmemek.
## 15                            crv Dn sorulan coinlerden biri alan kiiler maliyetlerini belirtebilirler yorumu ona gre yapabilirim. ATH seviyesinden ok geride TOBO oluumu bir ok coinde gzlemlenebilir nmzdeki hafta. Daha fazla Analiz iin..
## 16                                        pnt ilk ekleme noktas 0.87 seviyesi kademeli bir ekilde alma uygun cake eer ykseliini srdrmeye devam ederse pnt fiyat ok ucuz kalacak.. kademeli hedefler belirtilmitir ''Pump coini'' olduu aikar.
## 17                                                                 Hangi hatay yaptn tam olarak zemiyorsan ayn ekilde piyasa ile alakal i skntn telegram grubumuzda da dile getirebilirsiniz. Her eyden nce Psikolojinizi korumak zorundasnz.
## 18                                                                                                                                                                                        Buy BTC With Your Credit Card  Fast Crypto Exchange
## 19                                                                                                                                                                                        Buy BTC With Your Credit Card  Fast Crypto Exchange
## 20                                                                                                                                                                                     Buy Crypto With Your Credit Card  Fast Crypto Exchange
##       sentiment sentimentValue
## 1       Neutral              2
## 2      Positive              3
## 3      Positive              3
## 4      Negative              1
## 5      Positive              3
## 6  Verypositive              4
## 7       Neutral              2
## 8       Neutral              2
## 9      Positive              3
## 10      Neutral              2
## 11      Neutral              2
## 12      Neutral              2
## 13      Neutral              2
## 14      Neutral              2
## 15      Neutral              2
## 16      Neutral              2
## 17      Neutral              2
## 18      Neutral              2
## 19      Neutral              2
## 20      Neutral              2
## [1] Neutral      Positive     Negative     Verypositive Verynegative
## Levels: Neutral Positive Negative Verypositive Verynegative
## .
##      Neutral     Positive     Negative Verypositive Verynegative 
##        20875         5497         4119          195           11

##             status_id
## 1 1395375895532146699
## 2 1395376077246238724
## 3 1395376411603443720
## 4 1395376681573965825
## 5 1395376684514357251
## 6 1395376687014109185
##                                                                                                                                                                 text
## 1 If Shibu ever challenges Doge in the future it may be the battle of the century between cryptos.Crypto semantics make this a plausible reality in the near future.
## 2                                                                                                                                      hbar btc eth xrp vet ada doge
## 3                                                                                                                            What Is Internet Computer Protocol ICP?
## 4                                                                                                                Buy BTC With Your Credit Card  Fast Crypto Exchange
## 5                                                                                                        These ExJournalists Are Using AI to Catch Online Defamation
## 6                                                                                                          The Long Journey of Usoni an African Postapocalyptic Game
##   sentiment sentimentValue created_at             source      Dates Hours
## 1  Negative              3 2021-05-20 Twitter for iPhone 2021-05-20    13
## 2   Neutral              1 2021-05-20    Twitter Web App 2021-05-20    13
## 3   Neutral              1 2021-05-20    Twitter Web App 2021-05-20    13
## 4   Neutral              1 2021-05-20         CoinTweety 2021-05-20    13
## 5   Neutral              1 2021-05-20              IFTTT 2021-05-20    13
## 6   Neutral              1 2021-05-20              IFTTT 2021-05-20    13
##   Minutes
## 1      48
## 2      49
## 3      50
## 4      51
## 5      51
## 6      51

先看一下隨著time的變化情緒的變化(優點太密集了)

因為分鐘實在是切的有一點太細了,所以我們現在改成看時間的

## `summarise()` has grouped output by 'Dates'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups:   Dates [1]
##   Dates      Hours avg_sentiment
##   <chr>      <chr>         <dbl>
## 1 2021-05-20 13             1.51
## 2 2021-05-20 14             1.54
## 3 2021-05-20 15             1.55
## 4 2021-05-20 16             1.53
## 5 2021-05-20 17             1.68
## 6 2021-05-20 18             1.53
## # A tibble: 6 x 5
## # Groups:   Dates [1]
##   Dates      Hours avg_sentiment time          times              
##   <chr>      <chr>         <dbl> <chr>         <dttm>             
## 1 2021-05-20 13             1.51 2021-05-20 13 2021-05-20 13:00:00
## 2 2021-05-20 14             1.54 2021-05-20 14 2021-05-20 14:00:00
## 3 2021-05-20 15             1.55 2021-05-20 15 2021-05-20 15:00:00
## 4 2021-05-20 16             1.53 2021-05-20 16 2021-05-20 16:00:00
## 5 2021-05-20 17             1.68 2021-05-20 17 2021-05-20 17:00:00
## 6 2021-05-20 18             1.53 2021-05-20 18 2021-05-20 18:00:00

情緒分數從最低分0~最高分4 + 0,1 : very negative,negative + 2 : neutral + 3,4 : very positive,postive

##                  Date     Open      Dates Hours avg_sentiment          time
## 1 2021-05-20 14:00:00 41835.51 2021-05-20    14      1.538462 2021-05-20 14
## 2 2021-05-20 15:00:00 41547.04 2021-05-20    15      1.551724 2021-05-20 15
## 3 2021-05-20 16:00:00 41526.04 2021-05-20    16      1.534188 2021-05-20 16
## 4 2021-05-20 17:00:00 39383.56 2021-05-20    17      1.677273 2021-05-20 17
## 5 2021-05-20 18:00:00 39893.33 2021-05-20    18      1.527950 2021-05-20 18
## 6 2021-05-20 19:00:00 39920.72 2021-05-20    19      1.439759 2021-05-20 19
##   scaleOpen    scaleAS
## 1 1.9934591  0.6232510
## 2 1.8431808  0.7405717
## 3 1.8322409  0.5854476
## 4 0.7161177  1.8511723
## 5 0.9816819  0.5302689
## 6 0.9959507 -0.2498696

討論主題分析

LDA 主題分類

  • 根據詞頻,選擇只出現3字以上的字
  • 整理成url,word,n的格式之後,就可以轉dtm

P.S. groupby by之後原本的字詞結構會不見,把詞頻另存在一個reserved_word裡面

##      status_id       word      lemma pos ner lower_word lower_lemma
## 1 1.398194e+18       Come       come  VB   O       come        come
## 2 1.398194e+18       here       here  RB   O       here        here
## 3 1.398194e+18        beb        beb NNP   O        beb         beb
## 4 1.398194e+18   Ethereum   Ethereum NNP   O   ethereum    ethereum
## 5 1.398194e+18      Price      Price NNP   O      price       price
## 6 1.398194e+18 Prediction Prediction NNP   O prediction  prediction
## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART
##      status_id     lower_word count
## 1 1.395376e+18         battle     1
## 2 1.395376e+18        century     1
## 3 1.395376e+18     challenges     1
## 4 1.395376e+18 cryptos.crypto     1
## 5 1.395376e+18           doge     1
## 6 1.395376e+18         future     1

將資料轉換為Document Term Matrix (DTM)

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 17/83
## Sparsity           : 83%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##                      Terms
## Docs                  ada battle btc doge eth future future. hbar reality vet
##   1395375895532146688   0      1   0    1   0      1       1    0       1   0
##   1395376077246238720   1      0   1    1   1      0       0    1       0   1
##   1395376411603443712   0      0   0    0   0      0       0    0       0   0
##   1395376681573965824   0      0   1    0   0      0       0    0       0   0
##   1395376684514357248   0      0   0    0   0      0       0    0       0   0
##   1395376687014109184   0      0   0    0   0      0       0    0       0   0
##   1395376690839408640   0      0   0    0   0      0       0    0       0   0
##   1395376809823412224   0      0   1    0   0      0       0    0       0   0
##   1395376818476171264   0      0   0    0   0      0       0    0       0   0
##   1395376894783205376   1      0   1    1   1      0       0    0       0   0

建立LDA模型

透過perplexity找到最佳主題數

## [1] "2 topic(s) and use time is  0.930460929870605"
## [1] "4 topic(s) and use time is  24.2873389720917"
## [1] "6 topic(s) and use time is  45.7744100093842"
## [1] "8 topic(s) and use time is  56.1082332134247"
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

上圖選擇 4 個 topic

取出代表字詞(term)

topic 1 = “ ” topic 2 = “ ” topic 3 = “ ” topic 4 = “ ”

資料內容探索

## # A tibble: 29,745 x 3
## # Groups:   document [29,689]
##    document            topic gamma
##    <chr>               <int> <dbl>
##  1 1395376809823412224     1 0.875
##  2 1395377181518401536     1 0.820
##  3 1395377590454702080     1 0.587
##  4 1395377780527796224     1 0.985
##  5 1395378767145676800     1 0.977
##  6 1395378816768434176     1 0.972
##  7 1395379083815489536     1 0.985
##  8 1395381293991464960     1 0.844
##  9 1395382161499475968     1 0.898
## 10 1395382987567554560     1 0.989
## # … with 29,735 more rows
##                                                                                                                                                                                                               text
## 1                                                                                                                                  We fly to mars!!!eth With infinite burninggains value every dayeth in actioneth
## 2                                                                                                                              Come learnonwith aor! How to Buy Cryptocurrency on Bitrue With Debit or Credit Card
## 3                                                    Speklatif hareketlerparalarn gelecein teknolojisi olmasn vein gvenilirliini asla deitirmeyecektir.gelecein teknolojisinde ba ekmektedir.nun da dedii gibi ''.
## 4                                                                                                                                                      Old Masters Art Auguste Renoir Georges Rivire 1877. Buy at 
## 5 Grnm alalan kama 4h periyotta RSI da pozitif uyumsuzluk var ve 384 e kadar ykseli yapabilir.nemli olan Kama nn dna trendi krp kabilmesidir. 384 de kp kramazsa kama desteine yeniden iner destekdeki izlenir bnb

各個日期的主題分佈

## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.

把廣告的帖子刪掉再來分析一遍

一大部分的 Twitter 是廣告,我們要將其篩選出來。 一般正常用戶不會重複留言,所以我們把多次留言的去掉在進行網絡分析。