bitcoin
動機: 雖然有越來越多的人用機器學習和深度學習的方式來預測股市和比特幣的走勢。但我一直認為市場瞬息萬變,用過去的資料來預測未來漲跌實在不太可靠。那麼在這麼多變因中有沒有什麼是不變的,大概是是「人性」吧。情緒也許可以反應出來一二。 比特幣在5月23日的時候有一個明顯的下跌,所以想要分析在下跌的這一時間附近社群平台的討論都有什麼,同時情緒是怎樣的。
假設:
- 是否有提前的負面情緒的釋出?
- 情緒變化和價格變動是否有一定的相關性?
library(wordcloud2)
library(ggplot2)
library(scales)
library(rtweet)
library(dplyr)
library(xml2)
library(httr)
library(jsonlite)
library(magrittr)
library(data.tree)
library(tidytext)
library(stringr)
library(DiagrammeR)
library(magrittr)bitcoin 的歷史价格數據
bitcoin的歷史數據:
- 網站讀取:his_data <- read.table()
- 修改日期格式:as.POSIXct(his_data$Date)
- 儲存為 bitcoin_his_data.csv
# 讀取網頁中的歷史資料
his_data <- read.table("https://www.cryptodatadownload.com/cdd/gemini_BTCUSD_2021_1min.csv",header = T,sep = ",",skip=1 )
head(his_data)## Unix.Timestamp Date Symbol Open High Low Close
## 1 1.625098e+12 2021-07-01 00:01:00 BTCUSD 35032.65 35060.60 35032.65 35060.60
## 2 1.625098e+12 2021-07-01 00:00:00 BTCUSD 35021.00 35137.79 35021.00 35032.65
## 3 1.625098e+12 2021-06-30 23:59:00 BTCUSD 35079.71 35079.71 35021.00 35021.00
## 4 1.625097e+12 2021-06-30 23:58:00 BTCUSD 35070.96 35090.38 35069.07 35079.71
## 5 1.625097e+12 2021-06-30 23:57:00 BTCUSD 35054.06 35076.16 35047.26 35070.96
## 6 1.625097e+12 2021-06-30 23:56:00 BTCUSD 35059.21 35064.30 35042.63 35054.06
## Volume
## 1 0.02096801
## 2 0.15234732
## 3 0.09060078
## 4 0.02939092
## 5 0.37807248
## 6 0.37451313
## Unix.Timestamp Date Symbol Open High Low Close
## 1 1.625098e+12 2021-07-01 00:01:00 BTCUSD 35032.65 35060.60 35032.65 35060.60
## 2 1.625098e+12 2021-07-01 00:00:00 BTCUSD 35021.00 35137.79 35021.00 35032.65
## 3 1.625098e+12 2021-06-30 23:59:00 BTCUSD 35079.71 35079.71 35021.00 35021.00
## 4 1.625097e+12 2021-06-30 23:58:00 BTCUSD 35070.96 35090.38 35069.07 35079.71
## 5 1.625097e+12 2021-06-30 23:57:00 BTCUSD 35054.06 35076.16 35047.26 35070.96
## 6 1.625097e+12 2021-06-30 23:56:00 BTCUSD 35059.21 35064.30 35042.63 35054.06
## Volume
## 1 0.02096801
## 2 0.15234732
## 3 0.09060078
## 4 0.02939092
## 5 0.37807248
## 6 0.37451313
# "2021-05-20 13:48:32 UTC"
# "2021-05-28 08:26:10 UTC"
open_data <- his_data %>%
filter(Date >= "2021-05-20 13:48:32" & Date <= "2021-06-19 08:26:10")%>%
select(Date,Open)
head(open_data)## Date Open
## 1 2021-06-19 08:26:00 35745.18
## 2 2021-06-19 08:25:00 35699.35
## 3 2021-06-19 08:24:00 35721.90
## 4 2021-06-19 08:23:00 35733.24
## 5 2021-06-19 08:22:00 35712.97
## 6 2021-06-19 08:21:00 35693.15
open_data %>%
ggplot(aes(Date, Open)) +
theme_minimal() +
theme(plot.title = element_text(face = "bold")) +
# geom_vline(xintercept = as.numeric(as.Date("2021-05-28")), col='red', size = 1) +
geom_line(col = '#ffa500') +
labs(title = 'Bitcoin', x = '') bitcoin 的價格資料
# cited from: https://www.r-bloggers.com/querying-the-bitcoin-blockchain-with-r/
# install.packages("Rbitcoin")
library(Rbitcoin)- 以Public API下載匯率 Bitcoin屬於permissionless的blockchain,任何人都可以使用Public API下載匯率資料:
wait <- antiddos(market = 'kraken', antispam_interval = 5, verbose = 1)
market.api.process('kraken',c('BTC','EUR'),'ticker')## market base quote timestamp market_timestamp last vwap
## 1: kraken BTC EUR 2021-07-01 22:10:10 <NA> 28329.1 28751.47
## volume ask bid
## 1: 3123.995 28336.4 28329.8
bitcoin 的 tweet data
有關Twitter上bitcoin的討論資料是用tweet連結Twitter API,以ETH和bitcoin為關鍵詞抓取的,只能抓取近10天的資料。
情緒分析
有关 load data 的资料:https://www.r-bloggers.com/2019/05/how-to-save-and-load-datasets-in-r-an-overview/
# load data
# save(tweet_df, file = "Bitcoindata-2.Rdata")
bitcoin_data <- get(load("Bitcoindata-2.Rdata"))
head(bitcoin_data)## # A tibble: 6 x 90
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 88987226… 1398193870… 2021-05-28 08:26:10 JojorMaria "@CryptoPoorBo… Twitte…
## 2 14207956… 1398193720… 2021-05-28 08:25:34 Bitcoin_Win "Ethereum Pric… Revive…
## 3 14207956… 1397439017… 2021-05-26 06:26:39 Bitcoin_Win "Goldman Sachs… Revive…
## 4 14207956… 1397514509… 2021-05-26 11:26:38 Bitcoin_Win "Crypto Market… Revive…
## 5 85748594… 1398193356… 2021-05-28 08:24:08 borsa_sabir… "Ha şunu da be… Twitte…
## 6 13671238… 1398193207… 2021-05-28 08:23:32 OladipoAde1 "@AirdropStari… Twitte…
## # … with 84 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
## # reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## # is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## # quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
## # urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
## # media_url <list>, media_t.co <list>, media_expanded_url <list>,
## # media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
## # ext_media_expanded_url <list>, ext_media_type <chr>,
## # mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
## # quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
## # quoted_source <chr>, quoted_favorite_count <int>,
## # quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## # quoted_name <chr>, quoted_followers_count <int>,
## # quoted_friends_count <int>, quoted_statuses_count <int>,
## # quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## # retweet_source <chr>, retweet_favorite_count <int>,
## # retweet_retweet_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## # description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## # friends_count <int>, listed_count <int>, statuses_count <int>,
## # favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## # profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>
添加這篇Twitter的url,方便之後在做主題模型分析的時候可以比較方便的回溯,(例如找到最能代表某個主題的貼文)
BD_select <- bitcoin_data %>%
select(user_id, status_id, created_at, screen_name, text, source, favorite_count, retweet_count) %>%
mutate(Tweet_url = glue::glue("https://twitter.com/{screen_name}/status/{status_id}"))
# select(DateTime = created_at, User = screen_name, text, Tweet_url, Likes = favorite_count, RTs = retweet_count, URLs = urls_expanded_url)了解資料的資料筆數以及時間分布
created_at已經是一個date類型的欄位,因此可以直接用min,max來看最遠或最近的日期 註:rtweet最多只能抓到距今10天的資料
# 查看每小時的發文數量"1 hours",也可以是每分钟"1 mins",或是自定义几个小时
ts_plot(BD_select, "1 hours") +
theme_minimal() +
theme(plot.title = element_text(face = "bold")) +
labs(x = NULL, y = NULL,
title = "Frequency of #bitcoin AND ETH Twitter statuses from past 8 days",
subtitle = "Twitter status (tweet) counts aggregated using an hour intervals",
caption = "\nSource: Data collected from Twitter's REST API via rtweet")## [1] "2021-05-20 13:48:32 UTC"
## [1] "2021-05-28 08:26:10 UTC"
tweets內容清理
## 用於資料清理
clean = function(txt) {
txt = iconv(txt, "latin1", "ASCII", sub="") #改變字的encoding
txt = gsub("(@|#)\\w+", "", txt) #去除@或#後有數字,字母,底線 (標記人名或hashtag)
txt = gsub("(http|https)://.*", "", txt) #去除網址(.:任意字元,*:0次以上)
txt = gsub("[ \t]{2,}", "", txt) #去除兩個以上空格或tab
txt = gsub("\\n"," ",txt) #去除換行
txt = gsub("\\s+"," ",txt) #去除一個或多個空格(+:一次以上)
txt = gsub("^\\s+|\\s+$","",txt) #去除開頭/結尾有一個或多個空格
txt = gsub("&.*;","",txt) #去除html特殊字元編碼
txt = gsub("[^a-zA-Z0-9?!. ']","",txt) #除了字母,數字空白?!.的都去掉(表情符號去掉)
txt }
BD_select$text = clean(BD_select$text) #text套用資料清理
df = data.frame()
BD_df = rbind(df,BD_select) # transfer to data frame
BD_df = BD_df[!duplicated(BD_df[,"status_id"]),] #去除重複的tweetsdf共有90個欄位,但我們在這裡僅會使用幾個欄位:
user_id: 用戶id status_id : 推文id created_at : 發文時間 screen_name:網名 text : 推文內容 source : 發文來源 favorite_count:按讚數 retweet_count:回覆數量 Tweet_url:原始推文的url
## # A tibble: 6 x 9
## user_id status_id created_at screen_name text source favorite_count
## <chr> <chr> <dttm> <chr> <chr> <chr> <int>
## 1 8898722… 13981938… 2021-05-28 08:26:10 JojorMaria Come… Twitt… 0
## 2 1420795… 13981937… 2021-05-28 08:25:34 Bitcoin_Win Ethe… Reviv… 0
## 3 1420795… 13974390… 2021-05-26 06:26:39 Bitcoin_Win Gold… Reviv… 1
## 4 1420795… 13975145… 2021-05-26 11:26:38 Bitcoin_Win Cryp… Reviv… 0
## 5 8574859… 13981933… 2021-05-28 08:24:08 borsa_sabi… Ha u… Twitt… 0
## 6 1367123… 13981932… 2021-05-28 08:23:32 OladipoAde1 Awes… Twitt… 1
## # … with 2 more variables: retweet_count <int>, Tweet_url <glue>
把日期、小時、分鐘分開
BD_date <- BD_df%>%
select(status_id,created_at,source)
BD_date$Dates <- format(as.Date(BD_date$created_at,"%Y-%m-%d"), format = "%Y-%m-%d")## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
BD_date$Hours <- format(as.POSIXct(BD_date$created_at, "%Y-%m-%d %H:%M:%S", tz = ""), format = "%H")
BD_date$Minutes <- format(as.POSIXct(BD_date$created_at, "%Y-%m-%d %H:%M:%S", tz = ""), format = "%M")
head(BD_date)## # A tibble: 6 x 6
## status_id created_at source Dates Hours Minutes
## <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 1398193870828048… 2021-05-28 08:26:10 Twitter for Andr… 2021-05… 08 26
## 2 1398193720130891… 2021-05-28 08:25:34 Revive Social App 2021-05… 08 25
## 3 1397439017344331… 2021-05-26 06:26:39 Revive Social App 2021-05… 06 26
## 4 1397514509137301… 2021-05-26 11:26:38 Revive Social App 2021-05… 11 26
## 5 1398193356619014… 2021-05-28 08:24:08 Twitter Web App 2021-05… 08 24
## 6 1398193207457062… 2021-05-28 08:23:32 Twitter for Andr… 2021-05… 08 23
串連 CoreNLP
(1). API呼叫的設定 server端呼叫伺服器 :
- 需先在terminal開啟corenlp server輸入:stanford-corenlp-4.2.0
- 在corenlp的路徑下開啟terminal輸入 :java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
# 產生coreNLP的api url,將本地端的網址轉成符合coreNLP服務的url
generate_API_url <- function(host, port="9000",
tokenize.whitespace="false", annotators=""){ #斷詞依據不是空格
url <- sprintf('http://%s:%s/?properties={"tokenize.whitespace":"%s","annotators":"%s"}', host, port, tokenize.whitespace, annotators)
url <- URLencode(url) # 轉碼,把文字轉成百分號編碼(Percent-encoding)類型“%e8%98%8b%e6”
}
#指定服務的位置
host = "127.0.0.1"
generate_API_url(host)# 呼叫coreNLP api
# 涉及的功能 annotators=c("tokenize","ssplit","pos","lemma","ner","parse","sentiment")
call_coreNLP <- function(server_host, text, host="localhost", language="eng",
tokenize.whitespace="true", ssplit.eolonly="true", annotators=c("tokenize","ssplit","pos","lemma","ner","parse","sentiment")){
# 假設有兩個core-nlp server、一個負責英文(使用9000 port)、另一個則負責中文(使用9001 port)
port <- ifelse(language=="eng", 9000, 9001);
# 產生api網址
url <- generate_API_url(server_host, port=port,
tokenize.whitespace=tokenize.whitespace, annotators=paste0(annotators, collapse = ','))
result <- POST(url, body = text, encode = "json")
doc <- httr::content(result, "parsed","application/json",encoding = "UTF-8")
return (doc)
}
#文件使用coreNLP服務
coreNLP <- function(data,host){
# 依序將每個文件丟進core-nlp進行處理,每份文件的回傳結果為json格式
# 在R中使用objects來儲存處理結果
result <- apply(data, 1 , function(x){
object <- call_coreNLP(host, x['text'])
list(doc=object, data=x)
})
return(result)
}解析抽取資料的functions
- coreNLP_tokens_parser:解析 coreNLP 回傳的資料(爬蟲解析的部分),抽取出斷詞的結果
- do.call:do.call(what, args, quote = FALSE, envir = parent.frame()) 的功能就是執行一個函數,而這個函數的參數呢,是list的每個子元素。
- do.call + rbind:Parse Parse list of dataframes with
- coreNLP_dependency_parser:抽取出依存關係的結果
- coreNLP_sentiment_parser:抽取出語句情緒的結果
# 從回傳的object中整理斷詞出結果,輸出為 tidydata 格式
# 其實還厚很多的欄位,之後需要的話可以加入到這個function中 coreNLP_tokens_parser
coreNLP_tokens_parser <- function(coreNLP_objects){
result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
original_data <- obj$data
doc <- obj$doc
# for a sentences
sentences <- doc$sentences
if(length(sentences)!=0) # 新加入的條件
sen <- sentences[[1]]
# 抽取出斷詞出結果:tokens
tokens <- do.call(rbind, lapply(sen$tokens, function(x){
result <- data.frame(word=x$word, lemma=x$lemma, pos=x$pos, ner=x$ner)
result
}))
tokens <- original_data %>%
t() %>%
data.frame() %>%
select(-text) %>%
slice(rep(1:n(), each = nrow(tokens))) %>%
bind_cols(tokens)
tokens
}))
return(result)
}
# 從回傳的core-nlp object中整理出詞彙依存關係,輸出為 tidydata 格式
coreNLP_dependency_parser <- function(coreNLP_objects){
result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
original_data <- obj$data
doc <- obj$doc
# for a sentences
sentences <- doc$sentences
if(length(sentences)!=0) # 新加入的條件
sen <- sentences[[1]]
# 抽取出依存關係:Dependencies
dependencies <- do.call(rbind, lapply(sen$basicDependencies, function(x){
result <- data.frame(dep=x$dep, governor=x$governor, governorGloss=x$governorGloss, dependent=x$dependent, dependentGloss=x$dependentGloss)
result
}))
dependencies <- original_data %>%
t() %>%
data.frame() %>%
select(-text) %>%
slice(rep(1:n(), each = nrow(dependencies))) %>%
bind_cols(dependencies)
dependencies
}))
return(result)
}
# 從回傳的core-nlp object中整理出語句情緒,輸出為 tidydata 格式
coreNLP_sentiment_parser <- function(coreNLP_objects){
result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
original_data <- obj$data
doc <- obj$doc
# for a sentences
sentences <- doc$sentences
if(length(sentences)!=0) # 新加入的條件
sen <- sentences[[1]]
# 抽取出情緒:sentiment
sentiment <- original_data %>%
t() %>%
data.frame() %>%
bind_cols(data.frame(sentiment=sen$sentiment, sentimentValue=sen$sentimentValue))
sentiment
}))
return(result)
}資料可視化的function
- parse2tree:用樹狀圖的方式可視化依存關係
# 圖形化顯示dependency結果
parse2tree <- function(ptext) {
stopifnot(require(NLP) && require(igraph))
# this step modifies coreNLP parse tree to mimic openNLP parse tree
ptext <- gsub("[\r\n]", "", ptext)
ptext <- gsub("ROOT", "TOP", ptext)
## Replace words with unique versions
ms <- gregexpr("[^() ]+", ptext) # just ignoring spaces and brackets?
words <- regmatches(ptext, ms)[[1]] # just words
regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words)))) # add id to words
## Going to construct an edgelist and pass that to igraph
## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
edgelist <- matrix('', nrow=length(words)-2, ncol=2)
## Function to fill in edgelist in place
edgemaker <- (function() {
i <- 0 # row counter
g <- function(node) { # the recursive function
if (inherits(node, "Tree")) { # only recurse subtrees
if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
for (child in node$children) {
childval <- if(inherits(child, "Tree")) child$value else child
i <<- i+1
edgelist[i,1:2] <<- c(val, childval)
}
}
invisible(lapply(node$children, g))
}
}
})()
## Create the edgelist from the parse tree
edgemaker(Tree_parse(ptext))
tree <- FromDataFrameNetwork(as.data.frame(edgelist))
return (tree)
}## # A tibble: 6 x 2
## status_id text
## <chr> <chr>
## 1 1398193870828048… Come here beb
## 2 1398193720130891… Ethereum Price Prediction ETH bids goodbye to 3000 as bulli…
## 3 1397439017344331… Goldman Sachs Ethereum ETH Might Overtake Bitcoin BTC As A …
## 4 1397514509137301… Crypto Market Tanks 14 to 3Month Low Under 1.35 Trillion Et…
## 5 1398193356619014… Ha unu da belirteyimveyaay oldu mu diyenler var. Ay piyasas…
## 6 1398193207457062… Awesome project
運行CoreNLP
取得coreNLP回傳的物件 如果資料很多就要跑很久,所以最好跑一次之後將資料存儲下來
提取結果
(1). 斷詞、詞彙還原、詞性標註、NER
载入之前已经存好的coreNLP结果
袪除掉裡面的一些空值(因為是空值的化,沒辦法用我們上面設定好的函數來解析)
先test一下
# Debug
# 終於把 length(sentences)==0 的刪掉了
n=c()
for(i in 1:length(test)){
original_data <- test[[i]]$data
doc <- test[[i]]$doc
# for a sentences
sentences <- doc$sentences
if(length(sentences)==0) next # next 是可以直接跳過的,寫到function中的時候,只用做判斷就可以了
# sen <- sentences[[1]]
print(sprintf('number is %i', i))
}## [1] "number is 1"
## [1] "number is 2"
## [1] "number is 3"
## [1] "number is 4"
## [1] "number is 5"
## [1] "number is 6"
## [1] "number is 7"
## [1] "number is 8"
## [1] "number is 9"
## [1] "number is 10"
## [1] "number is 12"
## [1] "number is 13"
## [1] "number is 14"
## [1] "number is 15"
## [1] "number is 16"
## [1] "number is 17"
## [1] "number is 18"
## [1] "number is 19"
## [1] "number is 20"
## [1] "number is 21"
# 跳過了沒有值的 11,達到我們想要的效果
# [1] "number is 1"
# [1] "number is 2"
# [1] "number is 3"
# [1] "number is 4"
# [1] "number is 5"
# [1] "number is 6"
# [1] "number is 7"
# [1] "number is 8"
# [1] "number is 9"
# [1] "number is 10"
# [1] "number is 12"把test中的sentens是空值的rows刪掉了 长度从21变为了20,我们成功将空值删除了
## [1] 21
n=c()
for(i in 1:length(test)){
original_data <- test[[i]]$data
doc <- test[[i]]$doc
# for a sentences
sentences <- doc$sentences
if(length(sentences)==0)
n=c(i)
next
}
test=test[-n]
length(test)## [1] 20
## status_id word lemma pos ner
## 1 1397129949706915850 So so RB O
## 2 1397129949706915850 who who WP O
## 3 1397129949706915850 is be VBZ O
## 4 1397129949706915850 ready ready JJ O
## 5 1397129949706915850 to to TO O
## 6 1397129949706915850 make make VB O
# 這下算是刪乾淨了
# 找出所有y值的index,然后将它删除(应用到后面就是将所有空值的index删除)
ddata = list("z","dd","y","ff","s","5","y","u")
n=c()
for(i in 1:length(ddata)){
if(ddata[[i]] == "y")
n=c(n,i)
next
}
n## [1] 3 7
## [[1]]
## [1] "z"
##
## [[2]]
## [1] "dd"
##
## [[3]]
## [1] "ff"
##
## [[4]]
## [1] "s"
##
## [[5]]
## [1] "5"
##
## [[6]]
## [1] "u"
對所有的data做上面👆的事情
## [1] 30770
n=c()
for(i in 1:length(aa)){
original_data <- aa[[i]]$data
doc <- aa[[i]]$doc
# for a sentences
sentences <- doc$sentences
if(length(sentences)==0)
n=c(n,i) # 這種寫法就是類似python的append
next
}
aa=aa[-n]
length(aa)## [1] 30697
coreNLP_tokens_parser欄位:
- status_id : 對應原本df裡的status_id,為一則tweets的唯一id
- word: 原始斷詞
- lemma : 對斷詞做詞形還原
- pos : part-of-speech,詞性
- ner: 命名實體
# 這邊也是跑了蠻久的所以將tokens保存在了BDtokens.csv中
# tokens = coreNLP_tokens_parser(aa)
# head(tokens,20)
# write.csv(tokens,"BDtokens.csv", row.names = FALSE)直接讀取之前解析好的tokens
## status_id word lemma pos ner
## 1 1.398194e+18 Come come VB O
## 2 1.398194e+18 here here RB O
## 3 1.398194e+18 beb beb NNP O
## 4 1.398194e+18 Ethereum Ethereum NNP O
## 5 1.398194e+18 Price Price NNP O
## 6 1.398194e+18 Prediction Prediction NNP O
## 7 1.398194e+18 ETH ETH NNP O
## 8 1.398194e+18 bids bid NNS O
## 9 1.398194e+18 goodbye goodbye NN O
## 10 1.398194e+18 to to TO O
## 11 1.398194e+18 3000 3000 CD NUMBER
## 12 1.398194e+18 as as IN O
## 13 1.398194e+18 bullish bullish JJ O
## 14 1.398194e+18 momentum momentum NN O
## 15 1.398194e+18 fades fade VBZ O
## 16 1.398194e+18 Ethereum Ethereum NNP O
## 17 1.398194e+18 gracefully gracefully RB O
## 18 1.398194e+18 rose rise VBD O
## 19 1.398194e+18 from from IN O
## 20 1.398194e+18 the the DT O
NER分析
實體分析,我們可以知道參與討論的都有哪些國家、組織等
## [1] O NUMBER DATE ORGANIZATION
## [5] SET MISC NATIONALITY CAUSE_OF_DEATH
## [9] STATE_OR_PROVINCE TITLE LOCATION PERSON
## [13] CITY DURATION MONEY COUNTRY
## [17] ORDINAL CRIMINAL_CHARGE IDEOLOGY RELIGION
## [21] TIME PERCENT URL
## 23 Levels: CAUSE_OF_DEATH CITY COUNTRY CRIMINAL_CHARGE DATE ... URL
## [1] 16140
轉小寫
因為大小寫也會影響corenlp對NER的判斷,因此我們一開始給的推文內容是沒有處理大小寫的,但在跑完anotator後,為了正確計算詞頻,創建新欄位lower_word與lower_lemma,存放轉換小寫的word與lemma。轉成小寫的目的是要將不同大小寫的同一字詞(如Evergiven與evergiven)都換成小寫,再來計算詞頻
tokens$lower_word = tolower(tokens$word)
tokens$lower_lemma = tolower(tokens$lemma)
unique(tokens$ner)## [1] O NUMBER DATE ORGANIZATION
## [5] SET MISC NATIONALITY CAUSE_OF_DEATH
## [9] STATE_OR_PROVINCE TITLE LOCATION PERSON
## [13] CITY DURATION MONEY COUNTRY
## [17] ORDINAL CRIMINAL_CHARGE IDEOLOGY RELIGION
## [21] TIME PERCENT URL
## 23 Levels: CAUSE_OF_DEATH CITY COUNTRY CRIMINAL_CHARGE DATE ... URL
國家的分析
所以討論最多的還是美國和中國
tokens %>%
filter(ner == "COUNTRY") %>% #篩選NER為COUNTRY
group_by(lower_word) %>% #根據word分組
summarize(count = n()) %>% #計算每組
top_n(n = 13, count) %>%
ungroup() %>%
mutate(word = reorder(lower_word, count)) %>%
ggplot(aes(word, count)) +
geom_col()+
ggtitle("Word Frequency (NER is COUNTRY)") +
theme(text=element_text(size=14))+
coord_flip() 涉及到的組織(ORGANIZATION)
Cardano(ADA)是一個公共區塊鏈平台。 比特幣(BTC):對抗通脹、價值存儲、數位黃金 以太幣(ETH):加密貨幣世界的基礎設施、水電煤 幣安幣(BNB):為代表的平台幣:代表交易所股票 USDT 為代表的穩定幣:支付
tokens %>%
filter(ner == "ORGANIZATION") %>% #篩選NER為ORGANIZATION
group_by(lower_word) %>% #根據word分組
summarize(count = n()) %>% #計算每組
top_n(n = 10, count) %>%
ungroup() %>%
mutate(word = reorder(lower_word, count)) %>%
ggplot(aes(word, count)) +
geom_col()+
ggtitle("Word Frequency (NER is ORGANIZATION)") +
theme(text=element_text(size=14))+
coord_flip() 涉及到的人物(PERSON)
查爾斯·霍斯金森(Charles Hoskinson)是Bitshares的聯合創始人 卡爾達諾(Cardano)和以太坊(Ethereum)的聯合創始人,這兩個平台都是區塊鏈平台。 以太坊第二層擴容方案 Polygon(MATIC)
tokens %>%
filter(ner == "PERSON") %>% #篩選NER為PERSON
group_by(lower_word) %>% #根據word分組
summarize(count = n()) %>% #計算每組
top_n(n = 10, count) %>%
ungroup() %>%
mutate(word = reorder(lower_word, count)) %>%
ggplot(aes(word, count)) +
geom_col()+
ggtitle("Word Frequency (NER is PERSON)") +
theme(text=element_text(size=14))+
coord_flip()情緒分數
了解文章的情緒
## status_id
## 1 1398193870828048387
## 2 1398193720130891776
## 3 1397439017344331784
## 4 1397514509137301508
## 5 1398193356619014145
## 6 1398193207457062913
## 7 1396902299117752325
## 8 1397935607356272646
## 9 1398193017706717185
## 10 1397978301428289547
## 11 1397828282368811008
## 12 1398183614198714368
## 13 1398187296713842691
## 14 1397943404424794117
## 15 1397833729914163200
## 16 1397949637932236800
## 17 1397869800244850690
## 18 1397754842190270464
## 19 1397671741309128706
## 20 1396988433831055362
## text
## 1 Come here beb
## 2 Ethereum Price Prediction ETH bids goodbye to 3000 as bullish momentum fades Ethereum gracefully rose from the support formed at 1750 to trade near 2800. The recovery was not unique to the
## 3 Goldman Sachs Ethereum ETH Might Overtake Bitcoin BTC As A Store of Value Amid the recent crypto market volatility and the rising dominance of Ethereum ETH the debate of ETH vs BTC has
## 4 Crypto Market Tanks 14 to 3Month Low Under 1.35 Trillion Ethereum ETH Under 2000 On Sunday May 23 the cryptocurrency market has extended its weekly losses further tanking another 14 and
## 5 Ha unu da belirteyimveyaay oldu mu diyenler var. Ay piyasasn aha ayya girdik demekle olmaz ki. Grafikte ay piyasasna girdin diyen indikatr veya teknik yok. nemli olan u birok alt coin. Ocak ubat fiyatna geldi. Bitcoin de ocak fiyatnda
## 6 Awesome project
## 7 EK olarak EKL ile belirlenecek 5 kiiye 2er adet CAKE hediye edeceiz..
## 8 trx Bitcoin ve ETH baklrsa fiyat yl geneline gre halen daha uygun saylr olas bir yn hareketi belirterek destek ve direnleri izdim. Terste kalanlarn sabrla beklemesi gerekir en azndan salkl bir projede yatrmnz var.
## 9 xrp dn direnleri bugn destekleri konualm en yakn destek 84 seviyesi 2. destek 75 seviyesi.. Bugndzeltmesinin uzun soluklu olacan dnmyorum xrp ve eth nin 23 haftas daha var diye tahmin ediyorum..
## 10 Bakn bence Elon Musk 1 ise Bu adam kesinlikle maniplasyon konusunda 10'dur. Bana att tweetler hi bir ekilde samimi gelmiyor.. Bu adam HODL diyorsa ben endie duyuyorum gerekten.. En byk maniplasyonu bunlar yapyor
## 11 Piyasann yeilleri yakmas an meselesincesi minik bir dzeltme ardndan hafta sonu art depremleri ve nmzde gl bir ykseli dalgas nceki krlm noktalarndan coinler toplanabilir.. Analizler iin..
## 12 Hafta sonuna girerken olabilecek dzeltmeler den korkmamal ama temkinli olmak lazm stop kullanmay ihmal etmeyin. Ayrca aa dip grdnz destek noktalarna alm oltalar girebilirsiniz.
## 13 Bugn tekrar herkes elinde terste kalan coinleri yazsn tek tek bakalm gn ierisinde grafikler ne durumdaizin verdii srece grafikler zerinde yorum yaparz..
## 14 Hangi coin veya coinlerden terste kaldnz? Bu yl en ok hangi coinden kazandnz? Kar damsnz? Zarar damsnz? Yakn bir sre ierisinde geri dnm iin gzel bir alma balatacaz lk ders kaybetmemek.
## 15 crv Dn sorulan coinlerden biri alan kiiler maliyetlerini belirtebilirler yorumu ona gre yapabilirim. ATH seviyesinden ok geride TOBO oluumu bir ok coinde gzlemlenebilir nmzdeki hafta. Daha fazla Analiz iin..
## 16 pnt ilk ekleme noktas 0.87 seviyesi kademeli bir ekilde alma uygun cake eer ykseliini srdrmeye devam ederse pnt fiyat ok ucuz kalacak.. kademeli hedefler belirtilmitir ''Pump coini'' olduu aikar.
## 17 Hangi hatay yaptn tam olarak zemiyorsan ayn ekilde piyasa ile alakal i skntn telegram grubumuzda da dile getirebilirsiniz. Her eyden nce Psikolojinizi korumak zorundasnz.
## 18 Buy BTC With Your Credit Card Fast Crypto Exchange
## 19 Buy BTC With Your Credit Card Fast Crypto Exchange
## 20 Buy Crypto With Your Credit Card Fast Crypto Exchange
## sentiment sentimentValue
## 1 Neutral 2
## 2 Positive 3
## 3 Positive 3
## 4 Negative 1
## 5 Positive 3
## 6 Verypositive 4
## 7 Neutral 2
## 8 Neutral 2
## 9 Positive 3
## 10 Neutral 2
## 11 Neutral 2
## 12 Neutral 2
## 13 Neutral 2
## 14 Neutral 2
## 15 Neutral 2
## 16 Neutral 2
## 17 Neutral 2
## 18 Neutral 2
## 19 Neutral 2
## 20 Neutral 2
## [1] Neutral Positive Negative Verypositive Verynegative
## Levels: Neutral Positive Negative Verypositive Verynegative
sentiment$sentimentValue <- sentiment$sentimentValue %>%
as.numeric
#了解情緒文章的分佈
sentiment$sentiment %>%
table()## .
## Neutral Positive Negative Verypositive Verynegative
## 20875 5497 4119 195 11
# 所以其實只有這裏是把時間的格式轉換了一下的
BD_df$date = as.Date(BD_df$created_at)
sentiment %>%
merge(BD_df[,c("status_id","source","date")]) %>%
group_by(date) %>%
summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>%
ggplot(aes(x=date,y=avg_sentiment)) +
geom_line()# 這邊的時間還是只有年月日,我想要分鐘的資料
# 這邊是選出來分鐘的資料
senti_df <- sentiment%>%
merge(BD_date)
senti_df$created_at = as.Date(senti_df$created_at)
head(senti_df)## status_id
## 1 1395375895532146699
## 2 1395376077246238724
## 3 1395376411603443720
## 4 1395376681573965825
## 5 1395376684514357251
## 6 1395376687014109185
## text
## 1 If Shibu ever challenges Doge in the future it may be the battle of the century between cryptos.Crypto semantics make this a plausible reality in the near future.
## 2 hbar btc eth xrp vet ada doge
## 3 What Is Internet Computer Protocol ICP?
## 4 Buy BTC With Your Credit Card Fast Crypto Exchange
## 5 These ExJournalists Are Using AI to Catch Online Defamation
## 6 The Long Journey of Usoni an African Postapocalyptic Game
## sentiment sentimentValue created_at source Dates Hours
## 1 Negative 3 2021-05-20 Twitter for iPhone 2021-05-20 13
## 2 Neutral 1 2021-05-20 Twitter Web App 2021-05-20 13
## 3 Neutral 1 2021-05-20 Twitter Web App 2021-05-20 13
## 4 Neutral 1 2021-05-20 CoinTweety 2021-05-20 13
## 5 Neutral 1 2021-05-20 IFTTT 2021-05-20 13
## 6 Neutral 1 2021-05-20 IFTTT 2021-05-20 13
## Minutes
## 1 48
## 2 49
## 3 50
## 4 51
## 5 51
## 6 51
先看一下隨著time的變化情緒的變化(優點太密集了)
因為分鐘實在是切的有一點太細了,所以我們現在改成看時間的
hhhs <- senti_df %>%
group_by(Dates, Hours) %>%
dplyr::summarise(avg_sentiment = mean(sentimentValue,na.rm=T))## `summarise()` has grouped output by 'Dates'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: Dates [1]
## Dates Hours avg_sentiment
## <chr> <chr> <dbl>
## 1 2021-05-20 13 1.51
## 2 2021-05-20 14 1.54
## 3 2021-05-20 15 1.55
## 4 2021-05-20 16 1.53
## 5 2021-05-20 17 1.68
## 6 2021-05-20 18 1.53
# 合併小時和分鐘,方便之後畫圖
hhhs$time <- paste(hhhs$Dates, hhhs$Hours)
# hhhs$time <- as.POSIXct(paste(hhhs$time, hhhs$Minutes), format="%Y-%m-%d %H:%M")
# hhhs$time <- as.POSIXct(hhhs$time, format="%Y-%m-%d %H:%M:%S")
hhhs$times <- as.POSIXct(hhhs$time,format="%Y-%m-%d %H")
head(hhhs)## # A tibble: 6 x 5
## # Groups: Dates [1]
## Dates Hours avg_sentiment time times
## <chr> <chr> <dbl> <chr> <dttm>
## 1 2021-05-20 13 1.51 2021-05-20 13 2021-05-20 13:00:00
## 2 2021-05-20 14 1.54 2021-05-20 14 2021-05-20 14:00:00
## 3 2021-05-20 15 1.55 2021-05-20 15 2021-05-20 15:00:00
## 4 2021-05-20 16 1.53 2021-05-20 16 2021-05-20 16:00:00
## 5 2021-05-20 17 1.68 2021-05-20 17 2021-05-20 17:00:00
## 6 2021-05-20 18 1.53 2021-05-20 18 2021-05-20 18:00:00
情緒分數從最低分0~最高分4 + 0,1 : very negative,negative + 2 : neutral + 3,4 : very positive,postive
# 把bitcoin價錢做了表轉化之後和情緒合併到一張表上
mergedata <- open_data %>%
merge(hhhs, by.x="Date", by.y="times")
# ggplot() +
# theme_minimal() +
# theme(plot.title = element_text(face = "bold")) +
# geom_line(aes(x=Date,y=avg_sentiment),col = '#ffa500') +
mergedata$scaleOpen <- scale(mergedata$Open,center = T,scale = T)
mergedata$scaleAS <- scale(mergedata$avg_sentiment,center = T,scale = T)
head(mergedata)## Date Open Dates Hours avg_sentiment time
## 1 2021-05-20 14:00:00 41835.51 2021-05-20 14 1.538462 2021-05-20 14
## 2 2021-05-20 15:00:00 41547.04 2021-05-20 15 1.551724 2021-05-20 15
## 3 2021-05-20 16:00:00 41526.04 2021-05-20 16 1.534188 2021-05-20 16
## 4 2021-05-20 17:00:00 39383.56 2021-05-20 17 1.677273 2021-05-20 17
## 5 2021-05-20 18:00:00 39893.33 2021-05-20 18 1.527950 2021-05-20 18
## 6 2021-05-20 19:00:00 39920.72 2021-05-20 19 1.439759 2021-05-20 19
## scaleOpen scaleAS
## 1 1.9934591 0.6232510
## 2 1.8431808 0.7405717
## 3 1.8322409 0.5854476
## 4 0.7161177 1.8511723
## 5 0.9816819 0.5302689
## 6 0.9959507 -0.2498696
ggplot(data = mergedata)+
geom_line(aes(x=Date,y=scaleOpen))+
geom_line(aes(x=Date,y=scaleAS),linetype="dotted")討論主題分析
LDA 主題分類
- 根據詞頻,選擇只出現3字以上的字
- 整理成url,word,n的格式之後,就可以轉dtm
P.S. groupby by之後原本的字詞結構會不見,把詞頻另存在一個reserved_word裡面
## status_id word lemma pos ner lower_word lower_lemma
## 1 1.398194e+18 Come come VB O come come
## 2 1.398194e+18 here here RB O here here
## 3 1.398194e+18 beb beb NNP O beb beb
## 4 1.398194e+18 Ethereum Ethereum NNP O ethereum ethereum
## 5 1.398194e+18 Price Price NNP O price price
## 6 1.398194e+18 Prediction Prediction NNP O prediction prediction
# stop_words <- scan(file = "./dict/stopwords.txt", what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
# head(stop_words)
# 可以直接載入 tidytext 中預設的 stopwordx
data("stop_words")
head(stop_words)## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
# 篩選出我們需要的tokens data格式
new_tokens <- tokens %>%
select(status_id, lower_word) %>%
anti_join(stop_words, by = c("lower_word" = "word")) %>% # 去掉停用字裡的一些詞彙
filter((!str_detect(lower_word, regex("[0-9]")))) %>%
count(status_id, lower_word) %>%
rename(count=n)
head(new_tokens)## status_id lower_word count
## 1 1.395376e+18 battle 1
## 2 1.395376e+18 century 1
## 3 1.395376e+18 challenges 1
## 4 1.395376e+18 cryptos.crypto 1
## 5 1.395376e+18 doge 1
## 6 1.395376e+18 future 1
library(tm)
library(topicmodels)
library(purrr)
library(reshape2)
library(tidyr)
# Build graph data
library(tidygraph)
library(igraph)
library(ggraph)將資料轉換為Document Term Matrix (DTM)
freq = 3
# 依據字頻挑字
reserved_word <- new_tokens %>%
group_by(lower_word) %>%
count() %>%
filter(n > freq) %>%
unlist()
mask_removed <- new_tokens %>%
filter(lower_word %in% reserved_word)
# mask_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
mask_dtm <- mask_removed %>%
cast_dtm(status_id, lower_word, count)
inspect(mask_dtm[1:10,1:10])## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 17/83
## Sparsity : 83%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs ada battle btc doge eth future future. hbar reality vet
## 1395375895532146688 0 1 0 1 0 1 1 0 1 0
## 1395376077246238720 1 0 1 1 1 0 0 1 0 1
## 1395376411603443712 0 0 0 0 0 0 0 0 0 0
## 1395376681573965824 0 0 1 0 0 0 0 0 0 0
## 1395376684514357248 0 0 0 0 0 0 0 0 0 0
## 1395376687014109184 0 0 0 0 0 0 0 0 0 0
## 1395376690839408640 0 0 0 0 0 0 0 0 0 0
## 1395376809823412224 0 0 1 0 0 0 0 0 0 0
## 1395376818476171264 0 0 0 0 0 0 0 0 0 0
## 1395376894783205376 1 0 1 1 1 0 0 0 0 0
建立LDA模型
# 去掉矩陣中全部是0的行
raw.sum <- apply(mask_dtm,1,FUN=sum) #sum by raw each raw of the table
mask_dtm <- mask_dtm[raw.sum!=0,]
# 建立模型
# mask_lda <- LDA(mask_dtm, k = 4, control = list(seed = 123))透過perplexity找到最佳主題數
ldas = c()
topics = c(2,4,6,8)
for(topic in topics){
start_time <- Sys.time()
lda <- LDA(mask_dtm, k = topic, control = list(seed = 2021))
ldas <- c(ldas,lda)
print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
save(ldas,file = "ldas_result.rdata") # 將模型輸出成檔案
}## [1] "2 topic(s) and use time is 0.930460929870605"
## [1] "4 topic(s) and use time is 24.2873389720917"
## [1] "6 topic(s) and use time is 45.7744100093842"
## [1] "8 topic(s) and use time is 56.1082332134247"
topics = c(2,4,6,8)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
上圖選擇 4 個 topic
取出代表字詞(term)
# 建立模型
mask_lda <- LDA(mask_dtm, k = 4, control = list(seed = 123))
removed_word = c("bitcoin","eth")
# 看各群的常用詞彙
tidy(mask_lda, matrix = "beta") %>% # 取出topic term beta值
filter(! term %in% removed_word) %>%
group_by(topic) %>%
top_n(10, beta) %>% # beta值前10的字
ungroup() %>%
mutate(topic = as.factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered() topic 1 = “ ” topic 2 = “ ” topic 3 = “ ” topic 4 = “ ”
資料內容探索
# 在tidy function中使用參數"gamma"來取得 theta矩陣
mask_topics <- tidy(mask_lda, matrix="gamma") %>% # document topic gamma
group_by(document) %>%
top_n(1, wt=gamma)
mask_topics## # A tibble: 29,745 x 3
## # Groups: document [29,689]
## document topic gamma
## <chr> <int> <dbl>
## 1 1395376809823412224 1 0.875
## 2 1395377181518401536 1 0.820
## 3 1395377590454702080 1 0.587
## 4 1395377780527796224 1 0.985
## 5 1395378767145676800 1 0.977
## 6 1395378816768434176 1 0.972
## 7 1395379083815489536 1 0.985
## 8 1395381293991464960 1 0.844
## 9 1395382161499475968 1 0.898
## 10 1395382987567554560 1 0.989
## # … with 29,735 more rows
posts_topic <- merge(x = BD_df, y = mask_topics, by.x = "status_id", by.y="document")
# 看一下各主題在說甚麼
set.seed(123)
posts_topic %>% # 主題二
filter(topic==2) %>%
select(text) %>%
unique() %>%
sample_n(5)## text
## 1 We fly to mars!!!eth With infinite burninggains value every dayeth in actioneth
## 2 Come learnonwith aor! How to Buy Cryptocurrency on Bitrue With Debit or Credit Card
## 3 Speklatif hareketlerparalarn gelecein teknolojisi olmasn vein gvenilirliini asla deitirmeyecektir.gelecein teknolojisinde ba ekmektedir.nun da dedii gibi ''.
## 4 Old Masters Art Auguste Renoir Georges Rivire 1877. Buy at
## 5 Grnm alalan kama 4h periyotta RSI da pozitif uyumsuzluk var ve 384 e kadar ykseli yapabilir.nemli olan Kama nn dna trendi krp kabilmesidir. 384 de kp kramazsa kama desteine yeniden iner destekdeki izlenir bnb
各個日期的主題分佈
topics_name = c("1","2","3","4")
posts_topic %>%
mutate(artDate = as.Date(created_at)) %>%
group_by(artDate,topic) %>%
summarise(sum =sum(topic)) %>%
ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
geom_col() ## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
posts_topic %>%
mutate(artDate = as.Date(created_at)) %>%
group_by(artDate,topic) %>%
summarise(sum =sum(topic)) %>%
ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
geom_col(position="fill") ## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
把廣告的帖子刪掉再來分析一遍
一大部分的 Twitter 是廣告,我們要將其篩選出來。 一般正常用戶不會重複留言,所以我們把多次留言的去掉在進行網絡分析。
# 我們先查看一下留言數量的分佈,發現大部分都是1-2條,所以刪掉大於2條的
df_ads <- BD_df %>%
group_by(user_id) %>%
summarize(count = n()) %>%
group_by(count)%>%
filter(count<10)%>%
summarize(numbers = n())%>%
ggplot(aes(x=count, y=numbers)) +
geom_point() +
geom_line()
df_ads