This is a continuation from previous attempt to classify Bitcoin using logistic regression and KNN. The purpose is still the same i.e. identifying the factors that best predict the likelihood of Bitcoin price closing at a price higher vs the previous week, and this time, applying Random Forest classifier.
Few improvements will be included here: - It has been established through correlation matrix presented in the previous LBB that only VIX, broad market indices, gold price, and lagged trading volume has high correlation to Bitcoin price movements. Here, we will drop the other explanatory variables with relatively lower correlation to Bitcoin price, and this include business cycle variables such as treasury rates, Dollar Index, and halving hype / effects - Technical terms such as Bitcoin’s own EMA50 and MACD will be included - Timeframe will be shortened but frequency is daily instead of weekly
New variables
According to existing literature, gold and oil price proxies are both important explanatory variables for Bitcoin’s price behavior and they work very well with random forest classifier (up to 80% accuracy for short-term predictions). The reasoning is that Bitcoin has been seen by many investors and traders as digital substitute to gold which has held the safe haven role for many decades, and as such has been used to diversify against oil price’s volatility. For volatility proxy, previously VIX was used in the logit model. Now OVX, which is expected 30 days volatility for US crude oil will be used.
Technical indicators that will be used are EMA50 and MACD
library(yahoofinancer)
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(dplyr)
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:xts':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(TTR)
library(scales)
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(corrplot)
## corrplot 0.92 loaded
library(readr)
##
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
##
## col_factor
library(class)
library(ggcorrplot)
library(caret)
## Loading required package: lattice
library(inspectdf)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
#Set up the start and end date
start_date <- as.Date("2019-11-01")
end_date <- as.Date("2023-10-31")
#load the data from API
symbols <- c("BTC-USD", "^OVX", "CL=F", "GC=F")
data_frames_list <- list()
for (symbol in symbols) {
data <- as.data.frame(getSymbols(symbol, src = "yahoo", from = start_date, to = end_date, auto.assign = FALSE))
data_frames_list[[symbol]] <- data
}
#Rename list elements with the synbols
names(data_frames_list) <- symbols
#Load each symbol into separate dataframe
btc<- data_frames_list$`BTC-USD` %>%
select(`BTC-USD.Close`)
ovx<- data_frames_list$`^OVX` %>%
select(OVX.Close)
oil<- data_frames_list$`CL=F` %>%
select(`CL=F.Close`)
gold <- data_frames_list$`GC=F` %>%
select(`GC=F.Close`)
#change the index to Date
btc$Date <- rownames(btc)
rownames(btc) <- NULL
btc<- btc %>%
select(Date, everything())
ovx$Date <- rownames(ovx)
rownames(ovx) <- NULL
ovx<- ovx %>%
select(Date, everything())
oil$Date <- rownames(oil)
rownames(oil) <- NULL
oil<- oil %>%
select(Date, everything())
gold$Date <- rownames(gold)
rownames(gold) <- NULL
gold<- gold %>%
select(Date, everything())
#combine all to one df
data_frames_list <- list(btc$`BTC-USD.Close`, ovx$OVX.Close, oil$`CL=F.Close`, gold$`GC=F.Close`)
bitcoin_date <- btc$Date
final_df <- data.frame(Date = bitcoin_date, do.call(cbind, data_frames_list))
## Warning in (function (..., deparse.level = 1) : number of rows of result is not
## a multiple of vector length (arg 2)
new_column_names <- c("Date", "BTC", "OVX", "OIL", "GOLD")
colnames(final_df) <- new_column_names
#check if any empty cells
any(is.na(final_df))
## [1] FALSE
#Glimpse
glimpse(final_df)
## Rows: 1,461
## Columns: 5
## $ Date <chr> "2019-11-01", "2019-11-02", "2019-11-03", "2019-11-04", "2019-11-…
## $ BTC <dbl> 9261.104, 9324.718, 9235.354, 9412.612, 9342.527, 9360.880, 9267.…
## $ OVX <dbl> 34.37, 35.63, 33.77, 33.22, 31.45, 31.45, 34.22, 33.74, 35.15, 33…
## $ OIL <dbl> 56.20, 56.54, 57.23, 56.35, 57.15, 57.24, 56.86, 56.80, 57.12, 56…
## $ GOLD <dbl> 1508.0, 1508.0, 1480.8, 1490.2, 1464.2, 1461.3, 1455.5, 1452.1, 1…
Create Bitcoin technical indicators
Specifically the two that will be added as explanatory variables: - EMA50: Choice of 50 is due to empahsis on shorter timeframe - MACD: Standard MACD assumption of fast length = 12 days, slow length = 26 days, signal length = 9 days. Note for inclusion as explanatory variable we will use MACD histogram which is simply the difference between MACD and MACD Signal. In general positive value means bullish momentum, and negative values means otherwise
#Calculate EMA50
final_df$EMA50 <- EMA(final_df$BTC, n=50)
#Calculate MACD histogram
MACD <- MACD(final_df$BTC, nFast = 12, nSlow = 26, nSig = 9, maType = "EMA")
macd_histogram <- MACD[, "macd"] - MACD[, "signal"]
final_df$MACDhistogram <- macd_histogram
Data Wrangling
Next, calculate BTC price returns and remove all rows with empty values
# Define a function to calculate returns
calculate_returns <- function(column) {
column <- as.numeric(column)
return((column / lag(column, default = first(column))) - 1)
}
# Apply the function to BTC column
final_df$BTC_ret<- calculate_returns(final_df$BTC)
#drop all rows with empty values
final_df <- na.omit(final_df)
#turn BTC into target variable class
final_df$BTC_ret <- ifelse(final_df$BTC_ret <=0, 0, 1)
#check target class proportion
prop.table(table(final_df$BTC_ret))
##
## 0 1
## 0.4936261 0.5063739
#Remove BTC price data as we are going to use BTC_ret (levels 0 and 1) as target variable
final_df <- final_df %>%
select(-BTC)
#Ensure data types are appropriate
final_df$Date <- ymd(final_df$Date)
final_df <- final_df %>%
mutate(BTC_ret = as.factor(BTC_ret))
rownames(final_df) <- final_df$Date
final_df$Date <- NULL
Introduction to Variables in the dataset
Cross Validation
80% of the datasets will be used for training while the remaining 20% will be used for testing. After splitting, we would also be running a check to see if target variable class proportion remains somewhat balanced.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(x = nrow(final_df), size = nrow(final_df)*0.8)
#Splitting
model_train <- final_df[index,]
model_test <- final_df[-index,]
#Check prop table for target variable in the train dataset
prop.table(table(model_train$BTC_ret))
##
## 0 1
## 0.5031001 0.4968999
RF model
model_rf <- randomForest(BTC_ret~., data= model_train, ntree=1100, verbose=T)
model_test$Pred_rf <- predict(model_rf, model_test)
As we can see from the result below, the result is not good as it’s even lower than chance.
confusionMatrix(data=as.factor(model_test$Pred_rf), reference=model_test$BTC_ret, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 57 90
## 1 72 64
##
## Accuracy : 0.4276
## 95% CI : (0.3692, 0.4875)
## No Information Rate : 0.5442
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : -0.141
##
## Mcnemar's Test P-Value : 0.1817
##
## Sensitivity : 0.4156
## Specificity : 0.4419
## Pos Pred Value : 0.4706
## Neg Pred Value : 0.3878
## Prevalence : 0.5442
## Detection Rate : 0.2261
## Detection Prevalence : 0.4806
## Balanced Accuracy : 0.4287
##
## 'Positive' Class : 1
##
varImpPlot(model_rf)
This exercise will be to create a sentiment labeling process for a series of tweets related to Bitcoin and other cryptocurrencies.
This would ideally be very useful if we can further connect the quantities of positive sentiment per day with changes in Bitcoin price to uncover any kind of relationship, however this would not be possible due to data availability and computational power limitations. As a picture, there could be 100K tweets related to Bitcoin in a single day alone.
library(dplyr)
library(stringr)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
btc_tweets<- read.csv("archive/Bitcoin_tweets_dataset_2.csv")
glimpse(btc_tweets)
## Rows: 174,397
## Columns: 13
## $ user_name <chr> "ChefSam", "Roy⚡️", "Ethereum Yoda", "Viction", "Rosi…
## $ user_location <chr> "Sunshine State", "", "", "Paris, France", "London", …
## $ user_description <chr> "Culinarian | Hot Sauce Artisan | Kombucha Brewer | N…
## $ user_created <chr> "2011-03-23 03:50:13", "2022-01-30 17:41:41", "2022-0…
## $ user_followers <int> 4680, 770, 576, 236, 12731, 197, 5976, 26940, 792, 22…
## $ user_friends <int> 2643, 1145, 1, 1829, 46, 48, 1, 2050, 75, 3225, 4, 23…
## $ user_favourites <chr> "6232", "9166", "0", "2195", "134", "13", "107", "121…
## $ user_verified <chr> "False", "False", "False", "False", "False", "False",…
## $ date <chr> "2023-03-01 23:59:59", "2023-03-01 23:59:47", "2023-0…
## $ text <chr> "Which #bitcoin books should I think about reading ne…
## $ hashtags <chr> "['bitcoin']", "['Bitcoin']", "['Ethereum', 'ETH', 'B…
## $ source <chr> "Twitter for iPhone", "Twitter for iPhone", "Twitter …
## $ is_retweet <chr> "False", "False", "False", "False", "False", "False",…
Data Wrangling
Due to limitations in computational power, I will be limiting this exercise to 3000 tweets.
#clean up data & arrange in chronological order
btc_tweets<- btc_tweets %>%
mutate(
date = ymd_hms(date),
user_verified = as.factor(user_verified),
source=as.factor(source),
is_retweet = as.factor(is_retweet)) %>%
select(-user_created, -user_location, -user_description) %>%
arrange(desc(date)) %>%
head(3000)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = ymd_hms(date)`.
## Caused by warning:
## ! 59 failed to parse.
#remove data with no dates
btc_tweets<- btc_tweets[!is.na(btc_tweets$date), ]
#look at the first 6 rows
head(btc_tweets)
## user_name user_followers user_friends user_favourites
## 1 Live Price Crypto 4435 9 24
## 2 Cri.Credit CSC 63 564 246
## 3 PhotogTrader 134 249 1390
## 4 Crypto Alerted 1230 105 24
## 5 OceaniaBloom-William Oakley 1 18 1
## 6 ⬣ T - B i r d ⬣ 5288 1879 81957
## user_verified date
## 1 False 2023-03-05 23:59:56
## 2 False 2023-03-05 23:59:51
## 3 False 2023-03-05 23:59:51
## 4 False 2023-03-05 23:59:47
## 5 False 2023-03-05 23:59:43
## 6 False 2023-03-05 23:59:31
## text
## 1 #BTC price $22427\n\nDaily Data:\n•+0.36% 🟢\n•RSI: 44\n•Moving Average 20/50/100: 23584/22938/20018\n•Bollinger Bands: 21986/25183\n\nWeekly:\n•Since Monday: -4.79% 🔴\n•RSI: 53\n•MA 20/50/100: 19697/24068/35112\n•BB: 13908/25486\n\n#Crypto #Bitcoin #Live\n(1/13)\n464412
## 2 @creditweb3 @binance\n#BNB #BTC #ETH #Avax\n$CREDIT\n+2 M wallet. Verry fast.\nhttps://t.co/59Y3xa1Aqj https://t.co/TJTkZpEWWH
## 3 #BTC When I don't have a clue about price action, I'll just post pics.\n@CatherineGraceO #photographer #swimsuit #Swimwear #bathingsuit #ONEPIECE #FashionWeek #fashion https://t.co/Gg2SvNPGqJ
## 4 LGCY Network (LGCY) went up 15.5 percent in the last 1 minutes on https://t.co/rgEAV9gIXu. #LGCY #crypto #bitcoin #cryptocurrency #btc #ethereum #blockchain #eth #trading #altcoin #binance #cryptonews
## 5 EuroBliss's Omen With $15.\nCRYPTONOMICS II TRADONOMICS II INVESTONOMICS II TOKENOMICS II BLOOMONOMICS\n\n"If Opportunity doesn't Knock, build a Door."\n\n#oceaniabloom, #ethereumblockchain #eth #investing #onlinebusiness #btc #usdt\n\n#oceaniabloom, #investing…https://t.co/eJNszslg7W
## 6 $HEX is absolutely MURDERING #Bitcoin 🤯🤯🤯\nAnd it's only done 1 cycle, with full gatekeeping 😂🤣\n\nWhen #RichardHeart said he wanted to have the best performing asset in history, he wasn't kidding 😉 https://t.co/aGFEICYzks
## hashtags
## 1 ['BTC', 'Crypto', 'Bitcoin', 'Live']
## 2 ['BNB', 'BTC', 'ETH', 'Avax']
## 3 ['BTC', 'photographer', 'swimsuit', 'Swimwear', 'bathingsuit', 'ONEPIECE', 'FashionWeek', 'fashion']
## 4 ['LGCY', 'crypto', 'bitcoin', 'cryptocurrency', 'btc', 'ethereum', 'blockchain', 'eth', 'trading', 'altcoin', 'binance', 'cryptonews']
## 5 ['oceaniabloom', 'ethereumblockchain', 'eth', 'investing', 'onlinebusiness', 'btc', 'usdt', 'oceaniabloom', 'investing']
## 6 ['Bitcoin', 'RichardHeart']
## source is_retweet
## 1 vaiotapi False
## 2 Twitter for Android False
## 3 Twitter Web App False
## 4 Microsoft Power Platform False
## 5 LinkedIn False
## 6 Twitter Web App False
Data Cleaning
#turn content into corpus
btc_corpus <- btc_tweets$text %>%
VectorSource() %>%
VCorpus()
#remove numbers
btc_clean <- btc_corpus %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stemDocument) %>%
tm_map(stripWhitespace) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english"))
btc_dtm<- DocumentTermMatrix(btc_clean)
inspect(btc_dtm)
## <<DocumentTermMatrix (documents: 3000, terms: 8459)>>
## Non-/sparse entries: 43003/25333997
## Sparsity : 100%
## Maximal term length: 89
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs altcoin bitcoin bnb btc crypto cryptocurr eth ethereum price updat
## 1188 1 1 1 2 1 0 1 0 0 0
## 129 0 0 0 0 0 0 0 0 0 0
## 146 0 0 0 0 0 0 0 0 0 0
## 2609 0 1 1 2 2 0 2 1 0 0
## 2722 0 1 2 1 0 0 1 0 0 0
## 2733 0 1 2 1 0 0 1 0 0 0
## 2735 0 1 2 1 0 0 1 0 0 0
## 2943 0 1 2 1 0 0 1 0 0 0
## 516 0 1 0 0 1 0 0 0 0 0
## 99 0 0 0 0 0 0 0 0 0 0
Remove infrequent words
btc_freq<- findFreqTerms(btc_dtm, lowfreq = 20)
length(btc_freq)
## [1] 357
btc_final <- btc_dtm[, btc_freq]
inspect(btc_final)
## <<DocumentTermMatrix (documents: 3000, terms: 357)>>
## Non-/sparse entries: 24974/1046026
## Sparsity : 98%
## Maximal term length: 22
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs altcoin bitcoin bnb btc crypto cryptocurr eth ethereum price updat
## 1334 0 1 0 0 0 0 0 0 0 0
## 1336 0 2 0 0 0 1 0 0 0 0
## 1343 0 1 0 0 0 0 0 0 0 0
## 1345 0 1 0 0 0 0 0 0 0 0
## 1363 0 1 0 0 0 0 0 0 0 0
## 1369 0 1 0 0 0 0 0 0 0 0
## 1371 0 1 0 0 0 0 0 0 0 0
## 1443 0 1 0 0 0 0 0 0 0 0
## 235 0 1 1 1 2 0 1 0 0 0
## 2609 0 1 1 2 2 0 2 1 0 0
** Use the sentimentr library to process the polarity of each tweet**
library(sentimentr)
##
## Attaching package: 'sentimentr'
## The following object is masked from 'package:plotly':
##
## highlight
analyze_sentiment <- function(text) {
sentiment <- sentiment(text)
return(sentiment)
}
btc_sentiments <- tm_map(btc_clean, content_transformer(analyze_sentiment))
df <- data.frame(Text = sapply(btc_sentiments, function(x) as.character(x)))
df<- t(df)
df<- as.data.frame(df)
df$Sentiment <- ifelse(df$V4 < 0, "negative", "positive")
#Check sentiment proportion
prop.table(table(df$Sentiment))
##
## negative positive
## 0.1933333 0.8066667
The class balance is not great, and this tips us into knowing that we will probably see a very high recall number and very low accompanying specificity
Cross Validation
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(btc_final), nrow(btc_final)*0.75)
btc_train_x <- btc_final[index,]
btc_test_x <- btc_final[-index,]
nrow(btc_train_x)
## [1] 2250
nrow(btc_test_x)
## [1] 750
btc_train_y <- df$Sentiment[index]
btc_test_y <- df$Sentiment[-index]
btc_train_ym<- as.matrix(btc_train_y)
btc_test_ym<- as.matrix(btc_test_y)
nrow(btc_train_ym)
## [1] 2250
nrow(btc_test_ym)
## [1] 750
** Apply the Bernoulli Converter **
bernoulli_conv <- function(x){
# parameter ifelse: kondisi, Hasil jika Kondisi TRUE, Hasil jika Kondisi FALSE
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
btc_train_bn <- apply(X = btc_train_x, MARGIN = 2, FUN = bernoulli_conv)
btc_test_bn <- apply(X = btc_test_x, MARGIN = 2, FUN = bernoulli_conv)
nrow(btc_train_bn)
## [1] 2250
nrow(btc_train_ym)
## [1] 2250
library(e1071)
naive_btc <- naiveBayes(x = btc_train_bn,
y = btc_train_ym,
)
btc_pred_class <- predict(object = naive_btc,
newdata = btc_test_bn)
btc_test_ydf<- as.data.frame(btc_test_ym)
btc_test_ydf$pred <- btc_pred_class
head(btc_test_ydf)
## V1 pred
## 1 positive negative
## 2 positive positive
## 3 negative negative
## 4 positive positive
## 5 positive positive
## 6 positive negative
The Naive Bayes model has accuracy of 65% in terms of predicting sentiment labeling.
As predicted, the recall is decent at 72% but the specificity i.e. the ability to correctly identify negative class, is even lower than chance. For labeling purposes, you want to minimize False Positive i.e. tweets that are actually bad sentiment being labelled as positive. As such, we should be looking at post pred value / precision, which is decent at almost 80%.
confusionMatrix(as.factor(btc_test_ydf$pred), as.factor(btc_test_ydf$V1), positive = "positive")
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 87 130
## positive 52 481
##
## Accuracy : 0.7573
## 95% CI : (0.725, 0.7876)
## No Information Rate : 0.8147
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3395
##
## Mcnemar's Test P-Value : 1.146e-08
##
## Sensitivity : 0.7872
## Specificity : 0.6259
## Pos Pred Value : 0.9024
## Neg Pred Value : 0.4009
## Prevalence : 0.8147
## Detection Rate : 0.6413
## Detection Prevalence : 0.7107
## Balanced Accuracy : 0.7066
##
## 'Positive' Class : positive
##