Likelihood of Bitcoin Closing Weekly Positive - Classification in Machine Learning Part 1

Data sourcing from Yahoo Finance API

We will explore which factors best predict the likelihood of Bitcoin price closing at a price higher vs the previous week.

Firstly, we will load all the libraries we need for this exercise and start to collect our price data from Yahoo Finance API. The choice of explanatory variables are based on numerous literatures on Bitcoin, such as Bitcoin’s trading volume, as well as macro factors such as US broad indices, oil prices, gold prices, US treasury yields, inflation, dollar index, volatility and others are seen as important contributors to Bitcoin’s price movement. On top of this I will also include a categorical explanatory variable related to Bitcoin halving which many experts believe have strong impact on Bitcoin’s price performance.

library(yahoofinancer)
library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(dplyr)

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:xts':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2) 
library(scales) 
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(corrplot)

## corrplot 0.92 loaded

library(readr)

## 
## Attaching package: 'readr'

## The following object is masked from 'package:scales':
## 
##     col_factor

library(class)
library(ggcorrplot)
library(caret)

## Loading required package: lattice

#Set up the start and end date
start_date <- as.Date("2014-09-15")
end_date <- as.Date("2023-10-31")

#load the data from API
symbols <- c("BTC-USD", "^TNX", "^VIX", "DX-Y.NYB", "^IXIC", "^DJI", "^GSPC", "GC=F")
data_frames_list <- list()
for (symbol in symbols) {
  data <- as.data.frame(getSymbols(symbol, src = "yahoo", from = start_date, to = end_date, auto.assign = FALSE, periodicity = "weekly"))
  data_frames_list[[symbol]] <- data
}

#Rename list elements with the synbols
names(data_frames_list) <- symbols

#Load each symbol into separate dataframe
btc<- data_frames_list$`BTC-USD` %>% 
  select(`BTC-USD.Close`)

treasury<- data_frames_list$`^TNX`  %>% 
  select(TNX.Close)

vix<- data_frames_list$`^VIX`  %>% 
  select(VIX.Close)

dxy <- data_frames_list$`DX-Y.NYB` %>% 
  select(`DX-Y.NYB.Close`)

nasdaq<- data_frames_list$`^IXIC` %>% 
  select(IXIC.Close)

dow<- data_frames_list$`^DJI` %>% 
  select(DJI.Close)

sp500<- data_frames_list$`^GSPC` %>% 
  select(GSPC.Close)

gold <- data_frames_list$`GC=F` %>% 
  select(`GC=F.Close`)

#Get BTC vol data

btc_vol <- as.data.frame(getSymbols("BTC-USD", src = "yahoo", from = start_date, to = end_date, auto.assign = FALSE, periodicity="weekly")) %>% 
  select(`BTC-USD.Volume`)

Create Bitcoin halving dataframe

Next, we will need to create the Bitcoin halving data on weekly basis. Due to the data availability of certain predictors, we are unable to go back 10 years, so we will start at 15 September 2014. These are the dates for Bitcoin halving between 1 January 2015 and current date:

9 July 2016
11 May 2020

Since many believe that halving–both anticipation and post–are usually the strongest predictor of Bitcoin price, the following will be applied:

Value of 1 * 3 months before actual halving date * 6 months post actual halving date

Value of 0 * Any time periods outside of the one specified above

#Set up the halving dates and date ranges
halving_dates <- as.Date(c("2016-07-09", "2020-05-11"))
date_sequence <- seq(start_date, to = end_date, by = "weeks")

#Initializes the value column with 0 first, before we change some to 1 based on the interval 
values <- rep(0, length(date_sequence))

# Define the number of weeks before and after each halving date
weeks_before <- 3*4  
weeks_after <- 6*4

# Calculate the date ranges for each halving
halving_ranges <- lapply(halving_dates, function(halving_date) {
  before_start <- halving_date - weeks(weeks_before)
  after_end <- halving_date + weeks(weeks_after)
  interval(before_start, after_end)
})

# Set the value to 1 for dates within the halving range, and 0 otherwise:
for (halving_range in halving_ranges) {
  values <- ifelse(date_sequence %within% halving_range, 1, 0)
}

# Create a dataframe with Date and Value columns
df_halving <- data.frame(Date = date_sequence, Value = values)

#Calculate the proportion of halving days vs non halving days
table(df_halving$Value)

## 
##   0   1 
## 440  37

prop.table(table(df_halving$Value))

## 
##          0          1 
## 0.92243187 0.07756813

#Make sure Value is a factor
df_halving = df_halving %>% 
  mutate(Value=as.factor(Value))

Based on the above, ~92% of our categorical variable dataset are the dates that technically have no influence from Bitcoin halving, while ~8% are within the halving date range.

Data Wrangling

Next, we move on to ensure the following:

Convert the index into a Date object column for each of the imported datasets.
Equalize dates, ensuring that the data has the same dimension, and merge them all into one dataframe.
Remove any empty data if it exists.
Include a lagged term for BTC volume.
Apply a function to calculate returns using the formula: data[i] / data[i-1] - 1.

#Bitcoin
btc$Date <- rownames(btc)
rownames(btc) <- NULL
btc<- btc %>% 
  select(Date, everything())

#Treasury
treasury$Date <- rownames(treasury)
rownames(treasury) <- NULL
treasury<- treasury %>% 
  select(Date, everything())

#VIX
vix$Date <- rownames(vix)
rownames(vix) <- NULL
vix<- vix %>% 
  select(Date, everything())

#DXY
dxy$Date <- rownames(dxy)
rownames(dxy) <- NULL
dxy<- dxy %>% 
  select(Date, everything())

#NASDAQ
nasdaq$Date <- rownames(nasdaq)
rownames(nasdaq) <- NULL
nasdaq<- nasdaq %>% 
  select(Date, everything())

#DOW
dow$Date <- rownames(dow)
rownames(dow) <- NULL
dow<- dow %>% 
  select(Date, everything())

#SP500
sp500$Date <- rownames(sp500)
rownames(sp500) <- NULL
sp500<- sp500 %>% 
  select(Date, everything())

#GOLD
gold$Date <- rownames(gold)
rownames(gold) <- NULL
gold<- gold %>% 
  select(Date, everything())

#BTC vol
btc_vol$Date <- rownames(btc_vol)
rownames(btc_vol)<- NULL
btc_vol<- btc_vol %>% 
  select(Date, everything())

#Coerce all datasets to a sequence of common dates
common_dates <- seq(as.Date("2014-09-15"), as.Date("2023-10-31"), by = "weeks")
bitcoin_df <- merge(data.frame(Date = common_dates), btc, by = "Date", all.x = TRUE)
halving_df <- merge(data.frame(Date = common_dates), df_halving, by = "Date", all.x = TRUE)
treasury_df <- merge(data.frame(Date = common_dates), treasury, by = "Date", all.x = TRUE)
vix_df <- merge(data.frame(Date = common_dates), vix, by = "Date", all.x = TRUE)
dxy_df <- merge(data.frame(Date = common_dates), dxy, by = "Date", all.x = TRUE)
nasdaq_df<- merge(data.frame(Date = common_dates), nasdaq, by = "Date", all.x = TRUE)
dow_df<- merge(data.frame(Date = common_dates), dow, by = "Date", all.x = TRUE)
sp500_df <- merge(data.frame(Date = common_dates), sp500, by = "Date", all.x = TRUE)
gold_df <- merge(data.frame(Date = common_dates), gold, by = "Date", all.x = TRUE)
btc_vol_df <- merge(data.frame(Date = common_dates), btc_vol, by="Date", all.x=TRUE)

#Fill the missing values
bitcoin_df$`BTC-USD.Close` <- na.locf(bitcoin_df$`BTC-USD.Close`)
halving_df$Value <- na.locf(halving_df$Value)
treasury_df$TNX.Close <- na.locf(treasury$TNX.Close)
vix_df$VIX.Close <- na.locf(vix_df$VIX.Close)
dxy_df$`DX-Y.NYB.Close`<- na.locf(dxy_df$`DX-Y.NYB.Close`)
nasdaq_df$IXIC.Close<-na.locf(nasdaq_df$IXIC.Close)
dow_df$DJI.Close<- na.locf(dow_df$DJI.Close)
sp500_df$GSPC.Close <- na.locf(sp500_df$GSPC.Close)
gold_df$`GC=F.Close`<- na.locf(gold_df$`GC=F.Close`)
btc_vol_df$`BTC-USD.Volume` <- na.locf(btc_vol_df$`BTC-USD.Volume`)

#Final check to ensure no missing values
any(is.na(bitcoin_df))

## [1] FALSE

any(is.na(treasury_df))

## [1] FALSE

any(is.na(vix_df))

## [1] FALSE

any(is.na(dxy_df))

## [1] FALSE

any(is.na(nasdaq_df))

## [1] FALSE

any(is.na(dow_df))

## [1] FALSE

any(is.na(sp500_df))

## [1] FALSE

any(is.na(gold_df))

## [1] FALSE

any(is.na(df_halving))

## [1] FALSE

any(is.na(btc_vol_df))

## [1] FALSE

#Combine all the data into a single dataframe called final_df
data_frames_list <- list(bitcoin_df$`BTC-USD.Close`, df_halving$Value, treasury_df$TNX.Close, vix_df$VIX.Close, dxy_df$`DX-Y.NYB.Close`, nasdaq_df$IXIC.Close, dow_df$DJI.Close, sp500_df$GSPC.Close, gold_df$`GC=F.Close`, btc_vol_df$`BTC-USD.Volume`)
bitcoin_date <- bitcoin_df$Date
final_df <- data.frame(Date = bitcoin_date, do.call(cbind, data_frames_list))


#Rename the new columns for readability
new_column_names <- c("Date", "Bitcoin", "Halving", "Treasury", "VIX", "DXY", "Nasdaq", "Dow", "SP500", "Gold", "BTC_Vol")
colnames(final_df) <- new_column_names

#set Halving as factor (1 = No Bitcoin Halving Effects, 2 = Bitcoin Halving Effects)
final_df$Halving = as.factor(final_df$Halving)

#Include one lagged term for BTC_vol--as this is not a variable that is known ahead of time, but we would like to check if its lagged value has any predictive power
final_df<- final_df %>% 
  mutate(BTC_Vol_Lagged = lag(BTC_Vol, 1))

#final check
head(final_df)

##         Date Bitcoin Halving Treasury   VIX   DXY  Nasdaq      Dow   SP500
## 1 2014-09-15 398.821       1    2.587 12.11 84.80 4579.79 17279.74 2010.40
## 2 2014-09-22 377.181       1    2.535 14.85 85.64 4512.19 17113.15 1982.85
## 3 2014-09-29 320.510       1    2.447 14.55 86.69 4475.62 17009.69 1967.90
## 4 2014-10-06 378.549       1    2.307 21.24 85.91 4276.24 16544.10 1906.13
## 5 2014-10-13 389.546       1    2.199 21.99 85.19 4258.44 16380.41 1886.76
## 6 2014-10-20 354.704       1    2.273 16.11 85.73 4483.72 16805.41 1964.58
##     Gold   BTC_Vol BTC_Vol_Lagged
## 1 1215.3 156903400             NA
## 2 1214.1 186772600      156903400
## 3 1192.2 276657896      186772600
## 4 1221.0 341152804      276657896
## 5 1238.3 156902070      341152804
## 6 1231.2 113691800      156902070

Correlation Matrix

Next, we calculate correlation for each of the variables in the dataset. Consistent with past literatures, US broad indices look like they are able to explain Bitcoin price movements better than the other variables such as treasury rates, which importance may or may not have been overinflated by the media.

#Calculate correlation
cor_matrix <- cor(final_df[, sapply(final_df, is.numeric)], use = "pairwise.complete.obs")
corrplot(cor_matrix, method = "color")

Convert dataset from absolute to relative terms

Combine all the data into a single dataframe.
Convert the Bitcoin dataframe into a factor:
- Bitcoin takes a value of 1 if it closes weekly positive.
- Bitcoin takes a value of 0 if it closes weekly negative.
Convert explanatory variables that are numerical to relative values using the formula: value(t) / value(t-1) - 1.

#Extract the Date column to bind it back later
date_column <- final_df$Date
Halving <- final_df$Halving

numeric_columns<- final_df %>% 
  select_if(is.numeric)

#return function
calculate_returns <- function(column) {
  return((column / lag(column, default = first(column))) - 1)
}

#apply function to all numeric columns
numeric_columns_relative<- lapply(numeric_columns, calculate_returns)

#Bind the Date column back to the resulting dataframe
weekly_returns_df <- cbind(Date = date_column, Halving = Halving, as.data.frame(numeric_columns_relative))
weekly_returns_df$Halving <- ifelse(weekly_returns_df$Halving == "1", 0, 1)

#Remove the entire first row of the resulting dataframe 
weekly_returns_df <- weekly_returns_df[-c(1,2), ]

#Final check
head(weekly_returns_df)

##         Date Halving     Bitcoin     Treasury         VIX          DXY
## 3 2014-09-29       0 -0.15024879 -0.034714026 -0.02020203  0.012260662
## 4 2014-10-06       0  0.18108327 -0.057212956  0.45979378 -0.008997563
## 5 2014-10-13       0  0.02905034 -0.046814063  0.03531073 -0.008380878
## 6 2014-10-20       0 -0.08944253  0.033651716 -0.26739423  0.006338783
## 7 2014-10-27       0 -0.08122832  0.027276743 -0.12911240  0.013414136
## 8 2014-11-03       0  0.11467606 -0.009850108 -0.06486100  0.008747723
##          Nasdaq          Dow        SP500         Gold     BTC_Vol
## 3 -0.0081046731 -0.006045698 -0.007539628 -0.018038073  0.48125526
## 4 -0.0445479906 -0.027372037 -0.031388800  0.024157063  0.23312152
## 5 -0.0041626036 -0.009894129 -0.010161949  0.014168754 -0.54008272
## 6  0.0529020666  0.025945626  0.041245281 -0.005733746 -0.27539643
## 7  0.0327897399  0.034816727  0.027217061 -0.048814147 -0.05819329
## 8  0.0003864503  0.010546560  0.006872969 -0.001280847  0.09075607
##   BTC_Vol_Lagged
## 3     0.19036681
## 4     0.48125526
## 5     0.23312152
## 6    -0.54008272
## 7    -0.27539643
## 8    -0.05819329

#Change Bitcoin into factor with two levels i.e. 0 and 1
weekly_returns_df$Bitcoin <- ifelse(weekly_returns_df$Bitcoin>0, 1,0)
weekly_returns_df$Bitcoin <- as.factor(weekly_returns_df$Bitcoin)

#Check balance of our target variable
prop.table(table(weekly_returns_df$Bitcoin))

## 
##         0         1 
## 0.4463158 0.5536842

Note that as the proportion of the target variable is somewhat balanced, there is no need to apply over/undersampling procedure.At this stage, the data wrangling process is completed.

Part 2: Logistic Regression

Introduction to Variables in the dataset

Bitcoin: Target variable. 1 denotes closing weekly positive while 0 denotes closing weekly negative
Halving: 1 denotes weeks in which there is hype leading to Bitcoin halving event, and also weeks after the event
Treasury: The level of US government treasury bond yields
VIX : Chicago Board Options Exchange’s CBOE Volatility Index, a popular measure of the stock market’s expectation of volatility based on S&P 500 index options
Nasdaq: Nasdaq Composite Index, tracking the performance of technology and internet-based companies
Dow: Dow Jones Industrial Average, a stock market index representing 30 large, publicly-owned companies in the US
SP500: S&P 500 Index, measuring the performance of 500 of the largest companies in the US.
Gold Index: An index tracking the price of gold, a precious metal often used as a hedge against inflation.
BTC Volume Lagged: Bitcoin trading volume, the total amount of Bitcoin traded in a given period, indicating market activity.Note that we include 1 week lag in this dataset

Cross Validation

80% of the datasets will be used for training while the remaining 20% will be used for testing. After splitting, we would also be running a check to see if target variable class proportion remains somewhat balanced.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100) 
index <- sample(x = nrow(weekly_returns_df), size = nrow(weekly_returns_df)*0.8) 

#Splitting
weekly_returns_train <- weekly_returns_df[index,]
weekly_returns_test <- weekly_returns_df[-index,]

#Check prop table for target variable in the train dataset
prop.table(table(weekly_returns_train$Bitcoin))

## 
##         0         1 
## 0.4368421 0.5631579

No predictors

We will start with looking at the likelihood of Bitcoin closing weekly positive when there are no predictors involved. Bearing in mind that the dataset is from September 2016 to current date, and the train dataset contains (randomly) 80% of the entire dataset.

#The likelihood of Bitcoin going up in absence of any predictor variables
model_null<- glm(Bitcoin ~ 1, data = weekly_returns_train, family = "binomial")
summary(model_null)

## 
## Call:
## glm(formula = Bitcoin ~ 1, family = "binomial", data = weekly_returns_train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.2540     0.1034   2.456   0.0141 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 520.71  on 379  degrees of freedom
## Residual deviance: 520.71  on 379  degrees of freedom
## AIC: 522.71
## 
## Number of Fisher Scoring iterations: 3

#Likelihood of Bitcoin closing positive on weekly basis is approximately 1.3x, or 30% more likely than closing negative 
exp(model_null$coefficients)

## (Intercept) 
##    1.289157

#The probability of Bitcoin closing weekly positive is 78% given no changes in explanatory variables
1/(1+exp(-(exp(model_null$coefficients))))

## (Intercept) 
##   0.7840044

Selected predictors

We include all the numerical and categorical predictor variables to predict Bitcoin price movement.

model_all <- glm(Bitcoin~Treasury+VIX+DXY+Nasdaq+Dow+SP500+BTC_Vol_Lagged, weekly_returns_train, family="binomial")
summary(model_all)

## 
## Call:
## glm(formula = Bitcoin ~ Treasury + VIX + DXY + Nasdaq + Dow + 
##     SP500 + BTC_Vol_Lagged, family = "binomial", data = weekly_returns_train)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     0.23199    0.10990   2.111   0.0348 *
## Treasury        0.09682    1.77455   0.055   0.9565  
## VIX            -0.43605    0.98073  -0.445   0.6566  
## DXY            -6.32673   12.02063  -0.526   0.5987  
## Nasdaq          6.87330   15.88250   0.433   0.6652  
## Dow             3.70218   21.63181   0.171   0.8641  
## SP500          -3.68028   35.96630  -0.102   0.9185  
## BTC_Vol_Lagged  0.28096    0.35611   0.789   0.4301  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 520.71  on 379  degrees of freedom
## Residual deviance: 513.97  on 372  degrees of freedom
## AIC: 529.97
## 
## Number of Fisher Scoring iterations: 4

There are a few things to note from this result:

After all predictors are included, none of them are statistically significant except for the intercept
This could mean that there are other explanatory variables that are not included in the model. To solve this, we will need more references to more literature on Bitcoin price analysis
We should also check multicollinearity between the predictor variables

Check Multicollinearity

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(model_all)

##       Treasury            VIX            DXY         Nasdaq            Dow 
##       1.125943       2.452814       1.263733      16.722715      23.189644 
##          SP500 BTC_Vol_Lagged 
##      62.438391       1.005388

We find that perhaps not surprisingly that the US broad market indices (Nasdaq, Dow and SP500) are highly correlated to one another, so including more than one broad indices in the model does not add any informational value to our target variable. There appears to be no issues with the other predictor variables (all values less than 10).

Inspect AIC values and apply stepwise method to select predictors

Comparing the AIC values between model_null and model_all shows us that the model including all the predictors do not add any significant value to the model. Next, we will utilize stepwise method to help us choose the most suitable predictor variable.

The new model, model_step, has the lowest AIC level and is therefore a better model compared to model_null and model_all

#Compare AIC score between model_null and model_all
model_null$aic

## [1] 522.7125

model_all$aic

## [1] 529.9694

#Apply stepwise method
model_step <- step(object=model_all, direction = "backward", trace=F)

#Compare AIC score between model_null, model_all and model_step
model_step$aic

## [1] 519.1228

Evidently, the best AIC score is obtained by model_step, which will be our final model for logistic regression.

What are the most significant predictors of Bitcoin’s price movement according to the stepwise model?

summary(model_step)

## 
## Call:
## glm(formula = Bitcoin ~ Nasdaq, family = "binomial", data = weekly_returns_train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.2366     0.1044   2.267   0.0234 *
## Nasdaq        8.9826     3.8593   2.327   0.0199 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 520.71  on 379  degrees of freedom
## Residual deviance: 515.12  on 378  degrees of freedom
## AIC: 519.12
## 
## Number of Fisher Scoring iterations: 4

Model insights

Turns out, Nasdaq is the most significant predictor variable for Bitcoin’s weekly prices according to the stepwise method.

Several observations regarding model_step:

Assuming an alpha of 95%, both the intercept and predictor are statistically significant.
For all the other independent variables that are not included in model_step, we cannot rule out the fact that they may be statistically significant over the short term, bearing in mind that this logistic regression spans over 9 years.
Over the long run, only Nasdaq seems to have any significant predictive power.
There are also other factors which have not been explored yet but should be added as explanatory variables.

Interpretation of Coefficients

Interpretation of the coefficients of model_step:

If there are no changes to the value of Nasdaq, the likelihood of Bitcoin closing weekly positive is 1.2x compared to closing weekly negative. This translates to a probability of approximately 78%.
Given there is a 1% increase in Nasdaq, the likelihood of Bitcoin closing positive is significantly higher than closing negative, such that the probability of Bitcoin closing positive is close to 100%.

#Coefficient interpretations
exp(model_step$coefficients[1])

## (Intercept) 
##    1.266977

exp(model_step$coefficients[2])

##   Nasdaq 
## 7963.025

#Coefficient interpretations - Probability
1/(1+exp(-(exp(model_step$coefficients[1]))))

## (Intercept) 
##   0.7802249

1/(1+exp(-(exp(model_step$coefficients[2]))))

## Nasdaq 
##      1

Model Prediction

We will utilize the best fitting model, which is model_step to make the prediction for evaluation.

weekly_returns_test$Pred<- predict(object=model_step, newdata = weekly_returns_test, type="response")
weekly_returns_test$Pred_label<- ifelse(weekly_returns_test$Pred > 0.5, yes="1", no="0")
head(weekly_returns_test)

##          Date Halving Bitcoin     Treasury         VIX          DXY
## 3  2014-09-29       0       0 -0.034714026 -0.02020203  0.012260662
## 4  2014-10-06       0       1 -0.057212956  0.45979378 -0.008997563
## 10 2014-11-17       0       0 -0.002155119 -0.03080397  0.008911217
## 13 2014-12-08       0       0 -0.088426529  0.77291836 -0.010858628
## 19 2015-01-19       0       1  0.001101914 -0.20477331  0.026805052
## 21 2015-02-02       0       0  0.157014937 -0.17548872 -0.001054917
##          Nasdaq          Dow        SP500        Gold    BTC_Vol BTC_Vol_Lagged
## 3  -0.008104673 -0.006045698 -0.007539628 -0.01803807  0.4812553      0.1903668
## 4  -0.044547991 -0.027372037 -0.031388800  0.02415706  0.2331215      0.4812553
## 10  0.005210615  0.009941758  0.011608894  0.01054852 -0.1443458      0.7829721
## 13 -0.026598213 -0.037750818 -0.035193800  0.02680449  0.7570012     -0.3852450
## 19  0.026648657  0.009195594  0.016044222  0.01229536 -0.4790440      1.2108433
## 21  0.023549948  0.038411989  0.030315932 -0.03488461 -0.4013463      0.6220743
##         Pred Pred_label
## 3  0.5408670          1
## 4  0.4592106          0
## 10 0.5703891          1
## 13 0.4994285          0
## 19 0.6168046          1
## 21 0.6102048          1

Model Evaluation

The accuracy of the model_step using Nasdaq as the only predictor variable to Bitcoin’s movement is 60%, which is better than chance but not good as well. Recall / sensitivity is very high at >90% which means the model is able to maximize capturing the positive class. Recall that the target variable proportion was quite balanced when we build the model.

From business perspective, people who care about whether or not Bitcoin closes higher or lower and use this model are most likely speculative traders who use leverage to trade Bitcoin. Speculative leverage traders do not care whether Bitcoin price will go up or down as they can make money either way as long as they correctly position themselves. As such, there is no specific need to minimize either False Positive and False Negative. From performance perspective what we will need to maximize is accuracy i.e. the ability to correctly identify both positive and negative class.

confusionMatrix(data=as.factor(weekly_returns_test$Pred_label), reference=weekly_returns_test$Bitcoin, positive= '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 11  3
##          1 35 46
##                                           
##                Accuracy : 0.6             
##                  95% CI : (0.4944, 0.6992)
##     No Information Rate : 0.5158          
##     P-Value [Acc > NIR] : 0.0614          
##                                           
##                   Kappa : 0.1818          
##                                           
##  Mcnemar's Test P-Value : 4.934e-07       
##                                           
##             Sensitivity : 0.9388          
##             Specificity : 0.2391          
##          Pos Pred Value : 0.5679          
##          Neg Pred Value : 0.7857          
##              Prevalence : 0.5158          
##          Detection Rate : 0.4842          
##    Detection Prevalence : 0.8526          
##       Balanced Accuracy : 0.5890          
##                                           
##        'Positive' Class : 1               
##

Part 3: KNN

Predictor variables data scaling

Next we will try to see if the KNN model can perform better than Logistic Regression for the same issue.

#exp var
weekly_returns_train_x <- weekly_returns_train %>% select_if(is.numeric)
weekly_returns_test_x <- weekly_returns_test %>% 
  select(-Pred) %>% 
  select_if(is.numeric)

# target
weekly_train_y <- weekly_returns_train[,"Bitcoin"]
weekly_test_y <- weekly_returns_test[,"Bitcoin"]

# Scaling
weekly_returns_train_xs <- scale(weekly_returns_train_x)
weekly_returns_test_xs <- scale(weekly_returns_test_x, center = attr(weekly_returns_train_xs, "scaled:center"), 
                                   scale = attr(weekly_returns_train_xs, "scaled:scale"))

Predict

# find optimum k
k <- round(sqrt(nrow(weekly_returns_train_x)),0)
k

## [1] 19

#Predict
library(class)
BTC_knn<-  knn(train=weekly_returns_train_xs,
                  test=weekly_returns_test_xs, 
                  cl=weekly_train_y,
                  k=k)

weekly_returns_test$knn_pred <- BTC_knn
head(weekly_returns_test)

##          Date Halving Bitcoin     Treasury         VIX          DXY
## 3  2014-09-29       0       0 -0.034714026 -0.02020203  0.012260662
## 4  2014-10-06       0       1 -0.057212956  0.45979378 -0.008997563
## 10 2014-11-17       0       0 -0.002155119 -0.03080397  0.008911217
## 13 2014-12-08       0       0 -0.088426529  0.77291836 -0.010858628
## 19 2015-01-19       0       1  0.001101914 -0.20477331  0.026805052
## 21 2015-02-02       0       0  0.157014937 -0.17548872 -0.001054917
##          Nasdaq          Dow        SP500        Gold    BTC_Vol BTC_Vol_Lagged
## 3  -0.008104673 -0.006045698 -0.007539628 -0.01803807  0.4812553      0.1903668
## 4  -0.044547991 -0.027372037 -0.031388800  0.02415706  0.2331215      0.4812553
## 10  0.005210615  0.009941758  0.011608894  0.01054852 -0.1443458      0.7829721
## 13 -0.026598213 -0.037750818 -0.035193800  0.02680449  0.7570012     -0.3852450
## 19  0.026648657  0.009195594  0.016044222  0.01229536 -0.4790440      1.2108433
## 21  0.023549948  0.038411989  0.030315932 -0.03488461 -0.4013463      0.6220743
##         Pred Pred_label knn_pred
## 3  0.5408670          1        1
## 4  0.4592106          0        0
## 10 0.5703891          1        1
## 13 0.4994285          0        1
## 19 0.6168046          1        1
## 21 0.6102048          1        1

confusionMatrix(data=weekly_returns_test$knn_pred,
                reference=weekly_returns_test$Bitcoin,
                positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 17 10
##          1 29 39
##                                           
##                Accuracy : 0.5895          
##                  95% CI : (0.4838, 0.6894)
##     No Information Rate : 0.5158          
##     P-Value [Acc > NIR] : 0.090742        
##                                           
##                   Kappa : 0.1676          
##                                           
##  Mcnemar's Test P-Value : 0.003948        
##                                           
##             Sensitivity : 0.7959          
##             Specificity : 0.3696          
##          Pos Pred Value : 0.5735          
##          Neg Pred Value : 0.6296          
##              Prevalence : 0.5158          
##          Detection Rate : 0.4105          
##    Detection Prevalence : 0.7158          
##       Balanced Accuracy : 0.5827          
##                                           
##        'Positive' Class : 1               
##

Model Comparison and Conclusion

For this exercise, while the results are not satisfactory, logistic regression model performs slightly better at accuracy of 60% in comparison to KNN model. In conclusion, for Bitcoin leverage trading, monitoring Nasdaq will provide traders with slight advantage against the other uninformed traders.