Random Forest & Text Mining- Classification in Machine Learning Part 2

Random Forest

Part 1: Introduction

This is a continuation from previous attempt to classify Bitcoin using logistic regression and KNN. The purpose is still the same i.e. identifying the factors that best predict the likelihood of Bitcoin price closing at a price higher vs the previous week, and this time, applying Random Forest classifier.

Few improvements will be included here: - It has been established through correlation matrix presented in the previous LBB that only VIX, broad market indices, gold price, and lagged trading volume has high correlation to Bitcoin price movements. Here, we will drop the other explanatory variables with relatively lower correlation to Bitcoin price, and this include business cycle variables such as treasury rates, Dollar Index, and halving hype / effects - Technical terms such as Bitcoin’s own EMA50 and MACD will be included - Timeframe will be shortened but frequency is daily instead of weekly

New variables

According to existing literature, gold and oil price proxies are both important explanatory variables for Bitcoin’s price behavior and they work very well with random forest classifier (up to 80% accuracy for short-term predictions). The reasoning is that Bitcoin has been seen by many investors and traders as digital substitute to gold which has held the safe haven role for many decades, and as such has been used to diversify against oil price’s volatility. For volatility proxy, previously VIX was used in the logit model. Now OVX, which is expected 30 days volatility for US crude oil will be used.

Technical indicators that will be used are EMA50 and MACD

library(yahoofinancer)
library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(dplyr)

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:xts':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(TTR) 
library(scales) 
library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(corrplot)

## corrplot 0.92 loaded

library(readr)

## 
## Attaching package: 'readr'

## The following object is masked from 'package:scales':
## 
##     col_factor

library(class)
library(ggcorrplot)
library(caret)

## Loading required package: lattice

library(inspectdf)
library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

#Set up the start and end date
start_date <- as.Date("2019-11-01")
end_date <- as.Date("2023-10-31")

#load the data from API
symbols <- c("BTC-USD", "^OVX", "CL=F", "GC=F")
data_frames_list <- list()
for (symbol in symbols) {
  data <- as.data.frame(getSymbols(symbol, src = "yahoo", from = start_date, to = end_date, auto.assign = FALSE))
  data_frames_list[[symbol]] <- data
}

#Rename list elements with the synbols
names(data_frames_list) <- symbols

#Load each symbol into separate dataframe
btc<- data_frames_list$`BTC-USD` %>% 
  select(`BTC-USD.Close`)

ovx<- data_frames_list$`^OVX`  %>% 
  select(OVX.Close)

oil<- data_frames_list$`CL=F`  %>% 
  select(`CL=F.Close`)

gold <- data_frames_list$`GC=F` %>% 
  select(`GC=F.Close`)

#change the index to Date
btc$Date <- rownames(btc)
rownames(btc) <- NULL
btc<- btc %>% 
  select(Date, everything())

ovx$Date <- rownames(ovx)
rownames(ovx) <- NULL
ovx<- ovx %>% 
  select(Date, everything())

oil$Date <- rownames(oil)
rownames(oil) <- NULL
oil<- oil %>% 
  select(Date, everything())

gold$Date <- rownames(gold)
rownames(gold) <- NULL
gold<- gold %>% 
  select(Date, everything())

#combine all to one df
data_frames_list <- list(btc$`BTC-USD.Close`, ovx$OVX.Close, oil$`CL=F.Close`, gold$`GC=F.Close`)
bitcoin_date <- btc$Date
final_df <- data.frame(Date = bitcoin_date, do.call(cbind, data_frames_list))

## Warning in (function (..., deparse.level = 1) : number of rows of result is not
## a multiple of vector length (arg 2)

new_column_names <- c("Date", "BTC", "OVX", "OIL", "GOLD")
colnames(final_df) <- new_column_names

#check if any empty cells
any(is.na(final_df))

## [1] FALSE

#Glimpse
glimpse(final_df)

## Rows: 1,461
## Columns: 5
## $ Date <chr> "2019-11-01", "2019-11-02", "2019-11-03", "2019-11-04", "2019-11-…
## $ BTC  <dbl> 9261.104, 9324.718, 9235.354, 9412.612, 9342.527, 9360.880, 9267.…
## $ OVX  <dbl> 34.37, 35.63, 33.77, 33.22, 31.45, 31.45, 34.22, 33.74, 35.15, 33…
## $ OIL  <dbl> 56.20, 56.54, 57.23, 56.35, 57.15, 57.24, 56.86, 56.80, 57.12, 56…
## $ GOLD <dbl> 1508.0, 1508.0, 1480.8, 1490.2, 1464.2, 1461.3, 1455.5, 1452.1, 1…

Create Bitcoin technical indicators

Specifically the two that will be added as explanatory variables: - EMA50: Choice of 50 is due to empahsis on shorter timeframe - MACD: Standard MACD assumption of fast length = 12 days, slow length = 26 days, signal length = 9 days. Note for inclusion as explanatory variable we will use MACD histogram which is simply the difference between MACD and MACD Signal. In general positive value means bullish momentum, and negative values means otherwise

#Calculate EMA50
final_df$EMA50 <- EMA(final_df$BTC, n=50)

#Calculate MACD histogram
MACD <- MACD(final_df$BTC, nFast = 12, nSlow = 26, nSig = 9, maType = "EMA")
macd_histogram <- MACD[, "macd"] - MACD[, "signal"]
final_df$MACDhistogram <- macd_histogram

Data Wrangling

Next, calculate BTC price returns and remove all rows with empty values

# Define a function to calculate returns
calculate_returns <- function(column) {
  column <- as.numeric(column)  
  return((column / lag(column, default = first(column))) - 1)
}

# Apply the function to BTC column
final_df$BTC_ret<- calculate_returns(final_df$BTC)

#drop all rows with empty values
final_df <- na.omit(final_df)

#turn BTC into target variable class
final_df$BTC_ret <- ifelse(final_df$BTC_ret <=0, 0, 1)

#check target class proportion
prop.table(table(final_df$BTC_ret))

## 
##         0         1 
## 0.4936261 0.5063739

#Remove BTC price data as we are going to use BTC_ret (levels 0 and 1) as target variable
final_df <- final_df %>% 
  select(-BTC)

#Ensure data types are appropriate
final_df$Date <- ymd(final_df$Date)
final_df <- final_df %>% 
  mutate(BTC_ret = as.factor(BTC_ret))
rownames(final_df) <- final_df$Date
final_df$Date <- NULL

Part 2: Build Model

Introduction to Variables in the dataset

Bitcoin: Target variable. 1 denotes closing weekly positive while 0 denotes closing weekly negative
OVX : 30 days expectations of oil
Oil: Crude oil price index
Gold: Gold price index
EMA50: 50 days exponential smoothing of Bitcoin’s price
MACDhistogram: The difference between MACD and MACD line with fast length, slow length and EMA of 12, 26 and 9 days

Cross Validation

80% of the datasets will be used for training while the remaining 20% will be used for testing. After splitting, we would also be running a check to see if target variable class proportion remains somewhat balanced.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100) 
index <- sample(x = nrow(final_df), size = nrow(final_df)*0.8) 

#Splitting
model_train <- final_df[index,]
model_test <- final_df[-index,]

#Check prop table for target variable in the train dataset
prop.table(table(model_train$BTC_ret))

## 
##         0         1 
## 0.5031001 0.4968999

RF model

model_rf <- randomForest(BTC_ret~., data= model_train, ntree=1100, verbose=T)
model_test$Pred_rf <- predict(model_rf, model_test)

Part 3: Model Evaluation

As we can see from the result below, the result is not good as it’s even lower than chance.

confusionMatrix(data=as.factor(model_test$Pred_rf), reference=model_test$BTC_ret, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 57 90
##          1 72 64
##                                           
##                Accuracy : 0.4276          
##                  95% CI : (0.3692, 0.4875)
##     No Information Rate : 0.5442          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : -0.141          
##                                           
##  Mcnemar's Test P-Value : 0.1817          
##                                           
##             Sensitivity : 0.4156          
##             Specificity : 0.4419          
##          Pos Pred Value : 0.4706          
##          Neg Pred Value : 0.3878          
##              Prevalence : 0.5442          
##          Detection Rate : 0.2261          
##    Detection Prevalence : 0.4806          
##       Balanced Accuracy : 0.4287          
##                                           
##        'Positive' Class : 1               
##

varImpPlot(model_rf)

Text mining - Sentiment Analysis

Part 1: Introduction

This exercise will be to create a sentiment labeling process for a series of tweets related to Bitcoin and other cryptocurrencies.

This would ideally be very useful if we can further connect the quantities of positive sentiment per day with changes in Bitcoin price to uncover any kind of relationship, however this would not be possible due to data availability and computational power limitations. As a picture, there could be 100K tweets related to Bitcoin in a single day alone.

library(dplyr)
library(stringr)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(SnowballC) 

btc_tweets<- read.csv("archive/Bitcoin_tweets_dataset_2.csv")
glimpse(btc_tweets)

## Rows: 174,397
## Columns: 13
## $ user_name        <chr> "ChefSam", "Roy⚡️", "Ethereum Yoda", "Viction", "Rosi…
## $ user_location    <chr> "Sunshine State", "", "", "Paris, France", "London", …
## $ user_description <chr> "Culinarian | Hot Sauce Artisan | Kombucha Brewer | N…
## $ user_created     <chr> "2011-03-23 03:50:13", "2022-01-30 17:41:41", "2022-0…
## $ user_followers   <int> 4680, 770, 576, 236, 12731, 197, 5976, 26940, 792, 22…
## $ user_friends     <int> 2643, 1145, 1, 1829, 46, 48, 1, 2050, 75, 3225, 4, 23…
## $ user_favourites  <chr> "6232", "9166", "0", "2195", "134", "13", "107", "121…
## $ user_verified    <chr> "False", "False", "False", "False", "False", "False",…
## $ date             <chr> "2023-03-01 23:59:59", "2023-03-01 23:59:47", "2023-0…
## $ text             <chr> "Which #bitcoin books should I think about reading ne…
## $ hashtags         <chr> "['bitcoin']", "['Bitcoin']", "['Ethereum', 'ETH', 'B…
## $ source           <chr> "Twitter for iPhone", "Twitter for iPhone", "Twitter …
## $ is_retweet       <chr> "False", "False", "False", "False", "False", "False",…

Data Wrangling

Due to limitations in computational power, I will be limiting this exercise to 3000 tweets.

#clean up data & arrange in chronological order
btc_tweets<- btc_tweets %>% 
  mutate(
         date = ymd_hms(date),
         user_verified = as.factor(user_verified),
         source=as.factor(source),
         is_retweet = as.factor(is_retweet)) %>%
  select(-user_created, -user_location, -user_description) %>% 
  arrange(desc(date)) %>% 
  head(3000)

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = ymd_hms(date)`.
## Caused by warning:
## !  59 failed to parse.

#remove data with no dates
btc_tweets<- btc_tweets[!is.na(btc_tweets$date), ]

#look at the first 6 rows
head(btc_tweets)

##                     user_name user_followers user_friends user_favourites
## 1           Live Price Crypto           4435            9              24
## 2              Cri.Credit CSC             63          564             246
## 3                PhotogTrader            134          249            1390
## 4              Crypto Alerted           1230          105              24
## 5 OceaniaBloom-William Oakley              1           18               1
## 6             ⬣ T - B i r d ⬣           5288         1879           81957
##   user_verified                date
## 1         False 2023-03-05 23:59:56
## 2         False 2023-03-05 23:59:51
## 3         False 2023-03-05 23:59:51
## 4         False 2023-03-05 23:59:47
## 5         False 2023-03-05 23:59:43
## 6         False 2023-03-05 23:59:31
##                                                                                                                                                                                                                                                                                            text
## 1              #BTC price $22427\n\nDaily Data:\n•+0.36% 🟢\n•RSI: 44\n•Moving Average 20/50/100: 23584/22938/20018\n•Bollinger Bands: 21986/25183\n\nWeekly:\n•Since Monday: -4.79% 🔴\n•RSI: 53\n•MA 20/50/100: 19697/24068/35112\n•BB: 13908/25486\n\n#Crypto #Bitcoin #Live\n(1/13)\n464412
## 2                                                                                                                                                                @creditweb3 @binance\n#BNB #BTC #ETH #Avax\n$CREDIT\n+2 M wallet. Verry fast.\nhttps://t.co/59Y3xa1Aqj https://t.co/TJTkZpEWWH
## 3                                                                                               #BTC When I don't have a clue about price action, I'll just post pics.\n@CatherineGraceO #photographer #swimsuit #Swimwear #bathingsuit #ONEPIECE #FashionWeek #fashion https://t.co/Gg2SvNPGqJ
## 4                                                                                     LGCY Network (LGCY) went up 15.5 percent in the last 1 minutes on https://t.co/rgEAV9gIXu. #LGCY  #crypto #bitcoin #cryptocurrency #btc #ethereum #blockchain #eth #trading #altcoin #binance #cryptonews
## 5 EuroBliss's Omen With $15.\nCRYPTONOMICS II TRADONOMICS II INVESTONOMICS II TOKENOMICS II BLOOMONOMICS\n\n"If Opportunity doesn't Knock, build a Door."\n\n#oceaniabloom, #ethereumblockchain #eth #investing #onlinebusiness #btc #usdt\n\n#oceaniabloom, #investing…https://t.co/eJNszslg7W
## 6                                                          $HEX is absolutely MURDERING #Bitcoin 🤯🤯🤯\nAnd it's only done 1 cycle, with full gatekeeping 😂🤣\n\nWhen #RichardHeart said he wanted to have the best performing asset in history, he wasn't kidding 😉 https://t.co/aGFEICYzks
##                                                                                                                                 hashtags
## 1                                                                                                   ['BTC', 'Crypto', 'Bitcoin', 'Live']
## 2                                                                                                          ['BNB', 'BTC', 'ETH', 'Avax']
## 3                                   ['BTC', 'photographer', 'swimsuit', 'Swimwear', 'bathingsuit', 'ONEPIECE', 'FashionWeek', 'fashion']
## 4 ['LGCY', 'crypto', 'bitcoin', 'cryptocurrency', 'btc', 'ethereum', 'blockchain', 'eth', 'trading', 'altcoin', 'binance', 'cryptonews']
## 5               ['oceaniabloom', 'ethereumblockchain', 'eth', 'investing', 'onlinebusiness', 'btc', 'usdt', 'oceaniabloom', 'investing']
## 6                                                                                                            ['Bitcoin', 'RichardHeart']
##                     source is_retweet
## 1                 vaiotapi      False
## 2      Twitter for Android      False
## 3          Twitter Web App      False
## 4 Microsoft Power Platform      False
## 5                 LinkedIn      False
## 6          Twitter Web App      False

Data Cleaning

#turn content into corpus

btc_corpus <- btc_tweets$text %>% 
  VectorSource() %>% 
  VCorpus()

#remove numbers
btc_clean <- btc_corpus %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stemDocument) %>%
  tm_map(stripWhitespace) %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english"))

btc_dtm<- DocumentTermMatrix(btc_clean)
inspect(btc_dtm)

## <<DocumentTermMatrix (documents: 3000, terms: 8459)>>
## Non-/sparse entries: 43003/25333997
## Sparsity           : 100%
## Maximal term length: 89
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   altcoin bitcoin bnb btc crypto cryptocurr eth ethereum price updat
##   1188       1       1   1   2      1          0   1        0     0     0
##   129        0       0   0   0      0          0   0        0     0     0
##   146        0       0   0   0      0          0   0        0     0     0
##   2609       0       1   1   2      2          0   2        1     0     0
##   2722       0       1   2   1      0          0   1        0     0     0
##   2733       0       1   2   1      0          0   1        0     0     0
##   2735       0       1   2   1      0          0   1        0     0     0
##   2943       0       1   2   1      0          0   1        0     0     0
##   516        0       1   0   0      1          0   0        0     0     0
##   99         0       0   0   0      0          0   0        0     0     0

Remove infrequent words

btc_freq<- findFreqTerms(btc_dtm, lowfreq = 20)
length(btc_freq)

## [1] 357

btc_final <- btc_dtm[, btc_freq] 
inspect(btc_final)

## <<DocumentTermMatrix (documents: 3000, terms: 357)>>
## Non-/sparse entries: 24974/1046026
## Sparsity           : 98%
## Maximal term length: 22
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   altcoin bitcoin bnb btc crypto cryptocurr eth ethereum price updat
##   1334       0       1   0   0      0          0   0        0     0     0
##   1336       0       2   0   0      0          1   0        0     0     0
##   1343       0       1   0   0      0          0   0        0     0     0
##   1345       0       1   0   0      0          0   0        0     0     0
##   1363       0       1   0   0      0          0   0        0     0     0
##   1369       0       1   0   0      0          0   0        0     0     0
##   1371       0       1   0   0      0          0   0        0     0     0
##   1443       0       1   0   0      0          0   0        0     0     0
##   235        0       1   1   1      2          0   1        0     0     0
##   2609       0       1   1   2      2          0   2        1     0     0

** Use the sentimentr library to process the polarity of each tweet**

library(sentimentr)

## 
## Attaching package: 'sentimentr'

## The following object is masked from 'package:plotly':
## 
##     highlight

analyze_sentiment <- function(text) {
  sentiment <- sentiment(text)
  return(sentiment)
}

btc_sentiments <- tm_map(btc_clean, content_transformer(analyze_sentiment))
df <- data.frame(Text = sapply(btc_sentiments, function(x) as.character(x)))
df<- t(df)
df<- as.data.frame(df)


df$Sentiment <- ifelse(df$V4 < 0, "negative", "positive")

#Check sentiment proportion
prop.table(table(df$Sentiment))

## 
##  negative  positive 
## 0.1933333 0.8066667

The class balance is not great, and this tips us into knowing that we will probably see a very high recall number and very low accompanying specificity

Cross Validation

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index <- sample(nrow(btc_final), nrow(btc_final)*0.75)
btc_train_x <- btc_final[index,] 
btc_test_x <- btc_final[-index,]


nrow(btc_train_x)

## [1] 2250

nrow(btc_test_x)

## [1] 750

btc_train_y <- df$Sentiment[index]
btc_test_y <- df$Sentiment[-index]

btc_train_ym<- as.matrix(btc_train_y)
btc_test_ym<- as.matrix(btc_test_y)

nrow(btc_train_ym)

## [1] 2250

nrow(btc_test_ym)

## [1] 750

** Apply the Bernoulli Converter **

bernoulli_conv <- function(x){
  # parameter ifelse: kondisi, Hasil jika Kondisi TRUE, Hasil jika Kondisi FALSE
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

btc_train_bn <- apply(X = btc_train_x, MARGIN = 2, FUN = bernoulli_conv)
btc_test_bn <- apply(X = btc_test_x, MARGIN = 2, FUN = bernoulli_conv)

nrow(btc_train_bn)

## [1] 2250

nrow(btc_train_ym)

## [1] 2250

Part 2: Build Model

library(e1071)
naive_btc <- naiveBayes(x = btc_train_bn,
                         y = btc_train_ym,
                         )

btc_pred_class <- predict(object = naive_btc,
                          newdata = btc_test_bn)
                          

btc_test_ydf<- as.data.frame(btc_test_ym)
btc_test_ydf$pred <- btc_pred_class

head(btc_test_ydf)

##         V1     pred
## 1 positive negative
## 2 positive positive
## 3 negative negative
## 4 positive positive
## 5 positive positive
## 6 positive negative

Part 3: Model Evaluation

The Naive Bayes model has accuracy of 65% in terms of predicting sentiment labeling.

As predicted, the recall is decent at 72% but the specificity i.e. the ability to correctly identify negative class, is even lower than chance. For labeling purposes, you want to minimize False Positive i.e. tweets that are actually bad sentiment being labelled as positive. As such, we should be looking at post pred value / precision, which is decent at almost 80%.

confusionMatrix(as.factor(btc_test_ydf$pred), as.factor(btc_test_ydf$V1), positive = "positive")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative       87      130
##   positive       52      481
##                                          
##                Accuracy : 0.7573         
##                  95% CI : (0.725, 0.7876)
##     No Information Rate : 0.8147         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3395         
##                                          
##  Mcnemar's Test P-Value : 1.146e-08      
##                                          
##             Sensitivity : 0.7872         
##             Specificity : 0.6259         
##          Pos Pred Value : 0.9024         
##          Neg Pred Value : 0.4009         
##              Prevalence : 0.8147         
##          Detection Rate : 0.6413         
##    Detection Prevalence : 0.7107         
##       Balanced Accuracy : 0.7066         
##                                          
##        'Positive' Class : positive       
##

Random Forest & Text Mining- Classification in Machine Learning Part 2

Olivia Sim

2023-11-01

Random Forest

Part 1: Introduction

Part 2: Build Model

Part 3: Model Evaluation

Text mining - Sentiment Analysis

Part 1: Introduction

Part 2: Build Model

Part 3: Model Evaluation