1. Abstract

As Wall Street giants, retail investors, and aspiring cryptocurrency trailblazers continue to flood the cryptocurrency market, the ability to predict the volatility of cryptocurrency stocks has proven to be increasingly invaluable. In this report, we detail our methodology that applies statistical machine learning techniques to predict the direction of Bitcoin stocks. Our work also aims to build upon previous research conducted in anticipating trends within cryptocurrency using statistical machine learning methods. To predict whether the direction of Bitcoin stocks will increase or decrease on a given date, we employ Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbors (KNN), Logistic Regression Analysis, Random Forest, and Decision Trees. We also perform statistical analyses of our dataset using simple linear regression, multiple linear regression, and summary statistics, and visualize these metrics to gain a more comprehensive understanding of the variables that may contribute to deciding the direction of Bitcoin stocks on a given day. Finally, we analyze our results and discuss opportunities for future work and research.

2. Install the Libraries

library(tidyverse)
library(tidymodels)
library(discrim)
library(corrplot)
library(rpart.plot)
library(vip)
library(ranger)

3. BitCoin Data

Data Tidying and Cleaning

The data spans the years 2013-10-01 to 2021-06-13.

#setwd("/Users/haimanwong/Desktop")
BTC <- read_csv("./data/BTC_USD_2013-10-01_2021-06-13-CoinDesk.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Currency = col_character(),
##   Date = col_date(format = ""),
##   `Closing Price (USD)` = col_double(),
##   `24h Open (USD)` = col_double(),
##   `24h High (USD)` = col_double(),
##   `24h Low (USD)` = col_double()
## )

BTC %>%
  rename(Closing_Price = "Closing Price (USD)",
         Open_Price = "24h Open (USD)") %>%
  mutate(Lag1 = (lag(Closing_Price, n = 1) - lag(Open_Price, n = 1)) / lag(Open_Price, n = 1)*100,
         Lag2 = (lag(Closing_Price, n = 2) - lag(Open_Price, n = 2)) / lag(Open_Price, n = 2)*100,
         Lag3 = (lag(Closing_Price, n = 3) - lag(Open_Price, n = 3)) / lag(Open_Price, n = 3)*100,
         Lag4 = (lag(Closing_Price, n = 4) - lag(Open_Price, n = 4)) / lag(Open_Price, n = 4)*100,
         Lag5 = (lag(Closing_Price, n = 5) - lag(Open_Price, n = 5)) / lag(Open_Price, n = 5)*100,
         Return_Today = (Closing_Price - Open_Price) / Open_Price*100,
         Direction_Today = ifelse(Return_Today > 0, "Up", "Down"),
         Direction_Today = as.factor(Direction_Today)
         ) -> BTC_tidied 

names(BTC_tidied)

##  [1] "Currency"        "Date"            "Closing_Price"   "Open_Price"     
##  [5] "24h High (USD)"  "24h Low (USD)"   "Lag1"            "Lag2"           
##  [9] "Lag3"            "Lag4"            "Lag5"            "Return_Today"   
## [13] "Direction_Today"

head(BTC_tidied)

## # A tibble: 6 x 13
##   Currency Date       Closing_Price Open_Price `24h High (USD)` `24h Low (USD)`
##   <chr>    <date>             <dbl>      <dbl>            <dbl>           <dbl>
## 1 BTC      2013-10-01          124.       124.             125.           123. 
## 2 BTC      2013-10-02          125.       124.             126.           124. 
## 3 BTC      2013-10-03          109.       125.             126.            83.3
## 4 BTC      2013-10-04          119.       109.             119.           107. 
## 5 BTC      2013-10-05          121.       119.             122.           118. 
## 6 BTC      2013-10-06          121.       121.             122.           121. 
## # … with 7 more variables: Lag1 <dbl>, Lag2 <dbl>, Lag3 <dbl>, Lag4 <dbl>,
## #   Lag5 <dbl>, Return_Today <dbl>, Direction_Today <fct>

tail(BTC_tidied)

## # A tibble: 6 x 13
##   Currency Date       Closing_Price Open_Price `24h High (USD)` `24h Low (USD)`
##   <chr>    <date>             <dbl>      <dbl>            <dbl>           <dbl>
## 1 BTC      2021-06-08        33504.     33598.           34080.          31035.
## 2 BTC      2021-06-09        37127.     33413.           37332.          32437.
## 3 BTC      2021-06-10        36715.     37404.           38461.          35828.
## 4 BTC      2021-06-11        37342.     36695.           37587.          35982.
## 5 BTC      2021-06-12        35655.     37339.           37460.          34682.
## 6 BTC      2021-06-13        38895.     35568.           39296.          34823.
## # … with 7 more variables: Lag1 <dbl>, Lag2 <dbl>, Lag3 <dbl>, Lag4 <dbl>,
## #   Lag5 <dbl>, Return_Today <dbl>, Direction_Today <fct>

Data Dictionary

Variable	Class	Description
Currency	character	Bitcoin or BTC
Date	date	From 2013-10-01 to 2021-06-13
Closing_Price	double	The price at the end of a trading day (24hr)
Open_Price	double	The price at the beginning of a trading day (24hr)
24h High (USD)	double	The day’s highest price
24h Low (USD)	double	The day’s lowest price
Lag1	double	Percentage return for previous day
Lag2	double	Percentage return for 2 days previous
Lag3	double	Percentage return for 3 days previous
Lag4	double	Percentage return for 4 days previous
Lag5	double	Percentage return for 5 days previous
Return_Today	double	Percentage return for today
Return_Direction	factor	A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day

4. Data Analysis

Exploratory Data Analysis

Drop NAs. Because we mutated Lag1 to Lag5, resulting in 5 observations that include NAs.

BTC_tidied %>%
  drop_na() -> BTC_tidied

We can see that the vast majority of variables are numerical.

str(BTC_tidied)

## tibble [2,808 × 13] (S3: tbl_df/tbl/data.frame)
##  $ Currency       : chr [1:2808] "BTC" "BTC" "BTC" "BTC" ...
##  $ Date           : Date[1:2808], format: "2013-10-06" "2013-10-07" ...
##  $ Closing_Price  : num [1:2808] 121 122 123 124 126 ...
##  $ Open_Price     : num [1:2808] 121 121 122 123 124 ...
##  $ 24h High (USD) : num [1:2808] 122 122 124 125 128 ...
##  $ 24h Low (USD)  : num [1:2808] 121 120 121 123 124 ...
##  $ Lag1           : num [1:2808] 2.245 -0.563 0.945 1.016 0.826 ...
##  $ Lag2           : num [1:2808] 9.292 2.245 -0.563 0.945 1.016 ...
##  $ Lag3           : num [1:2808] -13.447 9.292 2.245 -0.563 0.945 ...
##  $ Lag4           : num [1:2808] 1.456 -13.447 9.292 2.245 -0.563 ...
##  $ Lag5           : num [1:2808] -0.523 1.456 -13.447 9.292 2.245 ...
##  $ Return_Today   : num [1:2808] -0.563 0.945 1.016 0.826 1.541 ...
##  $ Direction_Today: Factor w/ 2 levels "Down","Up": 1 2 2 2 2 1 2 2 2 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Currency = col_character(),
##   ..   Date = col_date(format = ""),
##   ..   `Closing Price (USD)` = col_double(),
##   ..   `24h Open (USD)` = col_double(),
##   ..   `24h High (USD)` = col_double(),
##   ..   `24h Low (USD)` = col_double()
##   .. )

There are 2813 observations and 13 variables in the data set now.

dim(BTC_tidied)

## [1] 2808   13

summary(BTC_tidied)

##    Currency              Date            Closing_Price       Open_Price     
##  Length:2808        Min.   :2013-10-06   Min.   :  120.7   Min.   :  120.7  
##  Class :character   1st Qu.:2015-09-07   1st Qu.:  468.4   1st Qu.:  467.7  
##  Mode  :character   Median :2017-08-09   Median : 3259.1   Median : 3233.5  
##                     Mean   :2017-08-09   Mean   : 6834.6   Mean   : 6820.8  
##                     3rd Qu.:2019-07-12   3rd Qu.: 8697.1   3rd Qu.: 8694.5  
##                     Max.   :2021-06-13   Max.   :63346.8   Max.   :63562.7  
##  24h High (USD)    24h Low (USD)          Lag1               Lag2         
##  Min.   :  121.8   Min.   :  120.4   Min.   :-27.0821   Min.   :-27.0821  
##  1st Qu.:  478.0   1st Qu.:  454.4   1st Qu.: -1.3074   1st Qu.: -1.3037  
##  Median : 3343.4   Median : 3157.2   Median :  0.1556   Median :  0.1559  
##  Mean   : 7032.0   Mean   : 6588.5   Mean   :  0.2964   Mean   :  0.3013  
##  3rd Qu.: 8905.4   3rd Qu.: 8381.3   3rd Qu.:  1.9602   3rd Qu.:  1.9616  
##  Max.   :64801.8   Max.   :62094.6   Max.   : 35.8493   Max.   : 35.8493  
##       Lag3               Lag4               Lag5           Return_Today     
##  Min.   :-27.0821   Min.   :-27.0821   Min.   :-27.0821   Min.   :-27.0821  
##  1st Qu.: -1.3074   1st Qu.: -1.3037   1st Qu.: -1.3037   1st Qu.: -1.3074  
##  Median :  0.1556   Median :  0.1559   Median :  0.1556   Median :  0.1556  
##  Mean   :  0.2959   Mean   :  0.2970   Mean   :  0.2929   Mean   :  0.2989  
##  3rd Qu.:  1.9616   3rd Qu.:  1.9616   3rd Qu.:  1.9602   3rd Qu.:  1.9602  
##  Max.   : 35.8493   Max.   : 35.8493   Max.   : 35.8493   Max.   : 35.8493  
##  Direction_Today
##  Down:1311      
##  Up  :1497      
##                 
##                 
##                 
##

To understand the relationship between multiple variables in the dataset so we removed the non-numerical variables.

correlation <- cor(BTC_tidied[, c(-1, -2, -13)])
correlation

##                Closing_Price  Open_Price 24h High (USD) 24h Low (USD)
## Closing_Price     1.00000000  0.99880186   0.9994734952   0.999279990
## Open_Price        0.99880186  1.00000000   0.9994306812   0.998847573
## 24h High (USD)    0.99947350  0.99943068   1.0000000000   0.998772826
## 24h Low (USD)     0.99927999  0.99884757   0.9987728256   1.000000000
## Lag1              0.00976569  0.01125970   0.0096635511   0.013361256
## Lag2              0.01159900  0.01016470   0.0100721317   0.012337575
## Lag3              0.01262511  0.01219673   0.0121812074   0.012012892
## Lag4              0.01400713  0.01310883   0.0128781730   0.014078814
## Lag5              0.01130891  0.01132739   0.0114678761   0.010566244
## Return_Today      0.01284739 -0.01394151   0.0003989676   0.002880128
##                        Lag1         Lag2         Lag3         Lag4         Lag5
## Closing_Price   0.009765690  0.011598997  0.012625107  0.014007131  0.011308910
## Open_Price      0.011259704  0.010164696  0.012196728  0.013108828  0.011327388
## 24h High (USD)  0.009663551  0.010072132  0.012181207  0.012878173  0.011467876
## 24h Low (USD)   0.013361256  0.012337575  0.012012892  0.014078814  0.010566244
## Lag1            1.000000000 -0.021338247 -0.009811714  0.015743597  0.044364669
## Lag2           -0.021338247  1.000000000 -0.023576098 -0.009803539  0.016631003
## Lag3           -0.009811714 -0.023576098  1.000000000 -0.023784649 -0.009886963
## Lag4            0.015743597 -0.009803539 -0.023784649  1.000000000 -0.023377327
## Lag5            0.044364669  0.016631003 -0.009886963 -0.023377327  1.000000000
## Return_Today   -0.022535226 -0.009035233  0.015306806  0.046234701  0.045251589
##                 Return_Today
## Closing_Price   0.0128473939
## Open_Price     -0.0139415089
## 24h High (USD)  0.0003989676
## 24h Low (USD)   0.0028801285
## Lag1           -0.0225352257
## Lag2           -0.0090352326
## Lag3            0.0153068056
## Lag4            0.0462347005
## Lag5            0.0452515893
## Return_Today    1.0000000000

5. Data Visualization

Correlation Table

When variables are visualized, it is easier to read the correlation between them. The Closing_Price, Open_Price, 24 High (USD) and 24 Low (USD) have a high correlation. Because Direction_Today is a response variable, if Return_Today is a predictor, which makes no sense. Thus, we just keep Lag1, Lag2, Lag3, Lag4, and Lag5 as predictors in the initial stage.

corrplot(correlation, method = "color", addCoef.col = "black", 
         number.cex = 0.7)

Bitcoin to the moon?

The scatterplot created using the ggplot function allows us to review the relationship between the Closing_Price variable and the Date variable.

BTC_tidied %>%
  ggplot(aes(Date, Closing_Price)) +
  geom_point(color = "blue") +
  theme_bw()

Bitcoin to the moon (with Opening Price)

Similar to the scatterplot created above, using the ggplot function allows us to review the relationship between the Open_Price variable and the Date variable.

BTC_tidied %>%
  ggplot(aes(Date, Open_Price)) +
  geom_point(color = "green") +
  theme_bw()

When we compare both scatterplots, we observe that the trends align between the dates. This affirms that when the Open Price on a given day is lower, the Closing Price will match that trend and also be lower. Likewise, when Open Price on a given day is higher, the Closing Price will align with that trend and also be higher. We can also observe that there are three noticeably visible peaks in Opening and Closing Price in 2018, the end of 2019, and mid-2021.

Variation of Date vs Return Today

Using the ggplot function again, we can use a bargraph to visualize the Return_Today variable against the Date variable to better understand the relationship between the variables:

ggplot(data = BTC_tidied) +
  geom_bar(mapping = aes(x = Date, y = Return_Today, fill = Date), stat = "identity")

The legend indicates that the overall returns day to day had a consistent trend of increases and decreases over the years. While some decreases and increases significantly outweighed others, the overarching trend on this bargraph shows that the frequency of returns increasing and decreasing everyday is variable.

Frequency of Ups vs Downs in a Pie Chart

Here, we create a pie chart to map out the frequency that the direction of Bitcoin stocks increases and decreases overall:

cd <- ggplot(BTC_tidied, aes(x="", y = Return_Today, fill = Direction_Today))+
  geom_bar(width = .5, stat = "identity") 
pie <- cd + coord_polar("y", start = 0) 
pie

Based on this pie chart, we see that while the pie chart is almost a 50/50 split between increases and decreases, there are still slightly more days where the direction of Bitcoin stocks increases as opposed to decreases.

6. Decision Tree and Random Forest

Decision Trees

Decision Tree Model 1

Here, we create a decision tree model using the default hyperparameters. We begin by setting a seed and creating the initial split, training data set, and testing data set:

bc_split <- initial_split(BTC_tidied)
set.seed(1234)
bc_train <- training(bc_split)
bc_test <- testing(bc_split)

Now, we create the decision tree model specification by setting the engine equal to rpart and the mode equal to classification:

decision_tree_rpart_spec <- 
  decision_tree() %>%
  set_engine('rpart') %>%
  set_mode('classification')

Next, we fit the decision tree model against our training data set:

dt_fit <- fit(decision_tree_rpart_spec, Direction_Today ~., data = bc_train)
dt_fit

## parsnip model object
## 
## Fit time:  28ms 
## n= 2106 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 2106 961 Up (0.4563153 0.5436847)  
##   2) Return_Today< -0.0002054209 961   0 Down (1.0000000 0.0000000) *
##   3) Return_Today>=-0.0002054209 1145   0 Up (0.0000000 1.0000000) *

Finally, we plot our decision tree:

rpart.plot(dt_fit$fit)

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Here, we see that the decision tree has one visible split and that values that are less than -205e-6 are classified as a downward direction, while values that are greater than -205e-6 are classified as an upward direction. We also see that our results align with the frequency that we observed in our pie chart because there are slightly more classifications of Bitcoin stocks increasing at 54% than decreasing at 46%.

Decision Tree Model 2: Hyperparameter

Now, we create a second decision tree model variation by tuning the hyperparameter where cost_complexity is equal to 0.5. After tuning the hyperparameter, we fit the decision tree model against the training data as we did before with the default hyperparameters:

decision_tree_rpart_spec_lambda <-
  decision_tree(cost_complexity = 0.5) %>%
  set_engine('rpart') %>%
  set_mode('classification')
  
dt_fit_lambda2 <- fit(decision_tree_rpart_spec_lambda, Direction_Today ~., data = bc_train)
dt_fit_lambda2

## parsnip model object
## 
## Fit time:  17ms 
## n= 2106 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 2106 961 Up (0.4563153 0.5436847)  
##   2) Return_Today< -0.0002054209 961   0 Down (1.0000000 0.0000000) *
##   3) Return_Today>=-0.0002054209 1145   0 Up (0.0000000 1.0000000) *

Next, we plot our decision tree:

rpart.plot(dt_fit_lambda2$fit)

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Based on this plot, we see that the decision tree has one visible split and that values that are less than -205e-6 are classified as a downward direction, while values that are greater than -205e-6 are classified as an upward direction. We also see that our results align with the frequency that we observed in our pie chart because there are slightly more classifications of Bitcoin stocks increasing at 54% than decreasing at 46%. This means that tuning the hyperparameter where cost_complexity is equal to 0.5 did not dramatically affect our final decision tree results because the results still match the first decision tree model that we tuned.

Decision Tree Model 3

To create a third decision tree model variation, we tune the hyperparameter where cost_complexity is equal to 3. After tuning the hyperparameter, we fit the decision tree model against the training data as we did before with the default hyperparameters:

decision_tree_rpart_spec_lambda <-
  decision_tree(cost_complexity = 3) %>%
  set_engine('rpart') %>%
  set_mode('classification')
  
dt_fit_lambda3 <- fit(decision_tree_rpart_spec_lambda, Direction_Today ~., data = bc_train)
dt_fit_lambda3

## parsnip model object
## 
## Fit time:  7ms 
## n= 2106 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 2106 961 Up (0.4563153 0.5436847) *

Then, we plot our decision tree model with tuned hyperparameters:

rpart.plot(dt_fit_lambda3$fit)

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Here, it is clear that tuning the hyperparameter of cost_complexity to 3 dramatically affected our decision tree model results. There is barely a visible split here and only the 54% increase is shown. This may also suggest that the tuned hyperparameters of this decision tree model are not that good of a fit for our data since it fails to yield results that meaningfully predict our Bitcoin stock directions.

Decision Tree Model 4

To create our fourth decision tree model variation, we tune the cost_complexity hyperparameter to a more moderate value of 1.5 and fit the decision tree model specification to our training data set:

decision_tree_rpart_spec_lambda <-
  decision_tree(cost_complexity = 1.5) %>%
  set_engine('rpart') %>%
  set_mode('classification')
  
dt_fit_lambda4 <- fit(decision_tree_rpart_spec_lambda, Direction_Today ~., data = bc_train)
dt_fit_lambda4

## parsnip model object
## 
## Fit time:  8ms 
## n= 2106 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 2106 961 Up (0.4563153 0.5436847) *

Once again, we plot our adjusted decision tree model variation:

rpart.plot(dt_fit_lambda4$fit)

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Similar to the model where our cost_complexity hyperparameter was tuned to equal 3, we observe that the tuning affected our results quite dramatically. It’s interesting that our plot here is the same as the plot generated from our third model where our cost_complexity hyperparameter was tuned to equal 3. Regardless, since there is barely a visible split here and only the 54% increase is shown, this may be an indication that the tuned hyperparameter of this decision tree model are not the best fit for our data since it fails to yield results that meaningfully predict the direction of our Bitcoin stocks on a given day.

Now that we’ve run four variations of the decision tree model, we can use the vi() function to find the variable importance for the first decision tree model that we created because it appears to have one of the best fits with the default hyperparameters:

vi(dt_fit)

## # A tibble: 6 x 2
##   Variable     Importance
##   <chr>             <dbl>
## 1 Return_Today    1045.  
## 2 Date              20.7 
## 3 Lag1              14.1 
## 4 Open_Price         9.79
## 5 Lag2               5.44
## 6 Lag5               5.44

The variable importance indicates that the Return_Today variable ranks the highest in importance.

Now, we use the vip() function to plot our variable importance for the first decision tree model that we created:

vip(dt_fit)

The results on the plot created for variable importance align with the numerical values that we generated above using the vi() function. The Return_Today variable is still the most important.

Random Forest

Now, we run a random forest model. To create the random forest model specification, we set the engine equal to ranger, importance equal to impurity, and the mode equal to classification:

rand_forest_ranger_spec <- 
  rand_forest() %>%
  set_engine('ranger', importance = "impurity") %>%
  set_mode('classification')
rand_forest_ranger_spec

## Random Forest Model Specification (classification)
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

Next, we set the seed and fit the random forest model specification against our training data set.

set.seed(4321)
rf_fit <- fit(rand_forest_ranger_spec, Direction_Today ~., data = bc_train)
rf_fit

## parsnip model object
## 
## Fit time:  462ms 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      2106 
## Number of independent variables:  12 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.001833082

We can then use the vi() function to find the variable importance.

vi(rf_fit$fit)

## # A tibble: 12 x 2
##    Variable       Importance
##    <chr>               <dbl>
##  1 Return_Today       956.  
##  2 Lag1                10.2 
##  3 Date                10.1 
##  4 Open_Price           9.97
##  5 Closing_Price        7.64
##  6 Lag3                 7.42
##  7 Lag2                 7.32
##  8 Lag5                 6.82
##  9 Lag4                 6.75
## 10 24h Low (USD)        6.01
## 11 24h High (USD)       5.55
## 12 Currency             0

Here, we see that for the random forest model, the Return_Today variable is the most important.

Finally, we can plot the variable importance of our random forest model:

vip(rf_fit$fit)

In doing so, we observe that the Return_Today variable is ranked with the highest importance for the random forest model.

7. Data Modeling

Before classifying Direction_Today, we can use linear regression models to examine the entire relationship between Return_Today and Lags1 to Lags5. We will concentrate on its coefficients.

Multiple Linear Regression Model

Here, we use the Return_Today variable as our response variable and the five Lag variables as our predictors to create our multiple regression model:

btc_mlc <- lm(Return_Today ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = BTC_tidied)
btc_mlc

## 
## Call:
## lm(formula = Return_Today ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, 
##     data = BTC_tidied)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##    0.276195    -0.025475    -0.009517     0.016422     0.047975     0.047803

In doing so, we obtain the coefficients for each of the predictor variables and the intercept. We can then use the summary() function to get a better understanding of which predictor variables are significant and how well our model fits our dataset.

summary(btc_mlc)

## 
## Call:
## lm(formula = Return_Today ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, 
##     data = BTC_tidied)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.785  -1.623  -0.101   1.653  35.313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.276195   0.081172   3.403 0.000677 ***
## Lag1        -0.025475   0.018883  -1.349 0.177409    
## Lag2        -0.009517   0.018859  -0.505 0.613849    
## Lag3         0.016422   0.018823   0.872 0.383055    
## Lag4         0.047975   0.018825   2.548 0.010874 *  
## Lag5         0.047803   0.018860   2.535 0.011313 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.249 on 2802 degrees of freedom
## Multiple R-squared:  0.005297,   Adjusted R-squared:  0.003522 
## F-statistic: 2.984 on 5 and 2802 DF,  p-value: 0.01084

With an adjusted r-squared value of 0.003522 and a multiple r-squared value of 0.005297, our results suggest that the model is not doing the best at explaining the variability in the response variable. Our low adjusted r-squared value also indicates that our model is not telling us very much about how much of the variability for the response variable is due to explanatory variables that have impact on the response variables. However, our p-value of 0.01084 is less than our standard significance level of 0.05, which suggests that we can reject the null hypothesis and our model is significant. We can also observe that only two of our predictor variables, Lag4 and Lag5, are less than our standard significance level of 0.05 and significant.

We can also use the plot() function to further analyze how well our model performs on our data set:

par(mfrow = c(2, 2))
plot(btc_mlc)

Overall, all four diagnostic plots suggest that our model is a moderately good fit for our data. There is evidence of some unequal variance and distribution in the Scale-Location and Residuals vs. Leverage plots, but the points still largely follow the fitted line and exhibit some strength in randomization. The QQ plot and Residuals vs. Fitted plots also indicate that our points are fairly normally distributed.

Simple Linear Regression Model 1

Now, we evaluate our dataset by running simple linear regression models between each of our predictor variables and our response variable one at a time. Here, we begin with the Lag1 predictor variable and the Return_Today response variable:

btc_slr <- lm(Return_Today ~ Lag1, data = BTC_tidied)
btc_slr

## 
## Call:
## lm(formula = Return_Today ~ Lag1, data = BTC_tidied)
## 
## Coefficients:
## (Intercept)         Lag1  
##     0.30557     -0.02255

Upon receiving the coefficients for this first simple linear regression model, we can apply the summary() function to acquire a better evaluation of how well this model fits our data set:

summary(btc_slr)

## 
## Call:
## lm(formula = Return_Today ~ Lag1, data = BTC_tidied)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.389  -1.613  -0.144   1.649  35.922 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.30557    0.08052   3.795 0.000151 ***
## Lag1        -0.02255    0.01889  -1.194 0.232566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.256 on 2806 degrees of freedom
## Multiple R-squared:  0.0005078,  Adjusted R-squared:  0.0001516 
## F-statistic: 1.426 on 1 and 2806 DF,  p-value: 0.2326

With an adjusted r-squared value of 0.0001516 and a multiple r-squared value of 0.0005078, our results suggest that the model is not doing the best at explaining the variability in the response variable. Our low adjusted r-squared value also indicates that our model is not telling us very much about how much of the variability for the response variable is due to explanatory variables that have impact on the response variables. Our p-value of 0.2326 is also greater than our standard significance level of 0.05, which suggests that we fail to reject the null hypothesis and our model is not significant. We can also observe that the predictor variable is not significant here because its p-value is 0.232566, which is greater than our standard significance level of 0.05.

Simple Linear Regression Model 2

To create our second simple linear regression model, we use Lag2 as our predictor variable and maintain our response variable as Return_Today:

btc_slr2 <- lm(Return_Today ~ Lag2, data = BTC_tidied)
btc_slr2

## 
## Call:
## lm(formula = Return_Today ~ Lag2, data = BTC_tidied)
## 
## Coefficients:
## (Intercept)         Lag2  
##    0.301610    -0.009037

Upon receiving the coefficients for this first simple linear regression model, we can apply the summary() function to acquire a better evaluation of how well this model fits our data set:

summary(btc_slr2)

## 
## Call:
## lm(formula = Return_Today ~ Lag2, data = BTC_tidied)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.381  -1.619  -0.146   1.668  35.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.301610   0.080539   3.745 0.000184 ***
## Lag2        -0.009037   0.018881  -0.479 0.632238    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.257 on 2806 degrees of freedom
## Multiple R-squared:  8.164e-05,  Adjusted R-squared:  -0.0002747 
## F-statistic: 0.2291 on 1 and 2806 DF,  p-value: 0.6322

With an adjusted r-squared value of -0.0002747 and a multiple r-squared value of 8.164e-05, our results suggest that the model is not doing the best at explaining the variability in the response variable. Our low adjusted r-squared value also indicates that our model is not telling us very much about how much of the variability for the response variable is due to explanatory variables that have impact on the response variables. Our p-value of 0.6322 is also greater than our standard significance level of 0.05, which suggests that we fail to reject the null hypothesis and our model is not significant. We can also observe that the predictor variable is not significant here because its p-value is 0.6322, which is greater than our standard significance level of 0.05.

Simple Linear Regression Model 3

Now, we create our third simple linear regression model by using the Lag3 variable as our predictor and maintain the Return_Today variable as our response:

btc_slr3 <- lm(Return_Today ~ Lag3, data = BTC_tidied)
btc_slr3

## 
## Call:
## lm(formula = Return_Today ~ Lag3, data = BTC_tidied)
## 
## Coefficients:
## (Intercept)         Lag3  
##     0.29437      0.01528

Upon receiving the coefficients for this first simple linear regression model, we can apply the summary() function to acquire a better evaluation of how well this model fits our data set:

summary(btc_slr3)

## 
## Call:
## lm(formula = Return_Today ~ Lag3, data = BTC_tidied)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.322  -1.607  -0.163   1.651  35.513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.29437    0.08052   3.656 0.000261 ***
## Lag3         0.01528    0.01884   0.811 0.417479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.257 on 2806 degrees of freedom
## Multiple R-squared:  0.0002343,  Adjusted R-squared:  -0.000122 
## F-statistic: 0.6576 on 1 and 2806 DF,  p-value: 0.4175

With an adjusted r-squared value of -0.000122 and a multiple r-squared value of 0.0002343, our results suggest that the model is not doing the best at explaining the variability in the response variable. Our low adjusted r-squared value also indicates that our model is not telling us very much about how much of the variability for the response variable is due to explanatory variables that have impact on the response variables. Our p-value of 0.4175 is also greater than our standard significance level of 0.05, which suggests that we fail to reject the null hypothesis and our model is not significant. We can also observe that the predictor variable is not significant here because its p-value is 0.417479, which is greater than our standard significance level of 0.05.

Simple Linear Regression Model 4

Here, we use the Lag4 variable as our predictor and maintain, once again, that the Return_Today variable is our response:

btc_slr4 <- lm(Return_Today ~ Lag4, data = BTC_tidied)
btc_slr4

## 
## Call:
## lm(formula = Return_Today ~ Lag4, data = BTC_tidied)
## 
## Coefficients:
## (Intercept)         Lag4  
##     0.28518      0.04616

Upon receiving the coefficients for this first simple linear regression model, we can apply the summary() function to acquire a better evaluation of how well this model fits our data set:

summary(btc_slr4)

## 
## Call:
## lm(formula = Return_Today ~ Lag4, data = BTC_tidied)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.975  -1.644  -0.108   1.623  35.494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.28518    0.08045   3.545 0.000399 ***
## Lag4         0.04616    0.01883   2.452 0.014277 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.253 on 2806 degrees of freedom
## Multiple R-squared:  0.002138,   Adjusted R-squared:  0.001782 
## F-statistic: 6.011 on 1 and 2806 DF,  p-value: 0.01428

With an adjusted r-squared value of 0.001782 and a multiple r-squared value of 0.002138, our results suggest that the model is not doing the best at explaining the variability in the response variable. Our low adjusted r-squared value also indicates that our model is not telling us very much about how much of the variability for the response variable is due to explanatory variables that have impact on the response variables. Our p-value of 0.01428 is less than our standard significance level of 0.05, which suggests that we can reject the null hypothesis and our model is significant. We can also observe that the predictor variable is significant here because its p-value is 0.014277, which is less than our standard significance level of 0.05.

Simple Linear Regression Model 5

Finally, we can fit our fifth simple linear regression model by using Lag5 as our predictor variable and maintain Return_Today as our response variable:

btc_slr5 <- lm(Return_Today ~ Lag5, data = BTC_tidied)
btc_slr5

## 
## Call:
## lm(formula = Return_Today ~ Lag5, data = BTC_tidied)
## 
## Coefficients:
## (Intercept)         Lag5  
##     0.28564      0.04523

Upon receiving the coefficients for this first simple linear regression model, we can apply the summary() function to acquire a better evaluation of how well this model fits our data set:

summary(btc_slr5)

## 
## Call:
## lm(formula = Return_Today ~ Lag5, data = BTC_tidied)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.268  -1.670  -0.131   1.684  34.988 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.28564    0.08045   3.551 0.000391 ***
## Lag5         0.04523    0.01885   2.400 0.016482 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.253 on 2806 degrees of freedom
## Multiple R-squared:  0.002048,   Adjusted R-squared:  0.001692 
## F-statistic: 5.758 on 1 and 2806 DF,  p-value: 0.01648

With an adjusted r-squared value of 0.001692 and a multiple r-squared value of 0.002048, our results suggest that the model is not doing the best at explaining the variability in the response variable. Our low adjusted r-squared value also indicates that our model is not telling us very much about how much of the variability for the response variable is due to explanatory variables that have impact on the response variables. Our p-value of 0.01648 is less than our standard significance level of 0.05, which suggests that we can reject the null hypothesis and our model is significant. We can also observe that the predictor variable is significant here because its p-value is 0.016482, which is less than our standard significance level of 0.05.

Overall, the fourth and fifth simple linear regression models performed the best because these were the only models that proved to be significant. Still, there is room for the models to be better fits for our data-set since the adjusted r-squared and multiple r-squared values were low across all variations of our simple linear regression models. There also appears to be a trend where the simple linear regression models are significant when our predictor variables are significant and insignificant when our predictor variables are insignificant.

Data Splitting

Now, we are going to use Logistic Regression, LDA, QDA, and KNN with the best neighbor to find the highest accuracy to predict Direction_Today.

set.seed(1000)
BTC_split <- initial_split(BTC_tidied)
BTC_train <- training(BTC_split)
BTC_test  <- testing(BTC_split)

Logistic Regression Model

Logistic Model Specification

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")
lr_spec

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Model Formula

According to the EDA above, we choose Lag1, Lag2, Lag3, Lag4, and Lag5 as predictors at the beginning.

rec_allpreds <- recipe(Direction_Today ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = BTC_train)
  # step_normalize(all_numeric_predictors())
rec_allpreds

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          5

Workflow

lr_wf1 <- workflow() %>%
  add_recipe(rec_allpreds) %>%
  add_model(lr_spec)

Model Fitting

lr_fit1 <- fit(lr_wf1, data = BTC_train)
lr_fit1

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##    0.114906    -0.027337     0.017894     0.020462    -0.005504     0.006499  
## 
## Degrees of Freedom: 2105 Total (i.e. Null);  2100 Residual
## Null Deviance:       2912 
## Residual Deviance: 2898  AIC: 2910

Final Accuracy

augment(lr_fit1, new_data = BTC_test) %>%
  accuracy(truth = Direction_Today, estimate = .pred_class) %>%
  mutate(description = "LR Model with Lag1 to Lag5") -> acc_lr_fit1
acc_lr_fit1

## # A tibble: 1 x 4
##   .metric  .estimator .estimate description               
##   <chr>    <chr>          <dbl> <chr>                     
## 1 accuracy binary         0.540 LR Model with Lag1 to Lag5

Confusion Matrix

augment(lr_fit1, new_data = BTC_test) %>%
  conf_mat(truth = Direction_Today, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Coefficients Table

Only two predictors are significant, and we know that predictors with a high p-value cause an increase in variance without a corresponding decrease in bias so we just keeping Lag1 and Lag3 as model predictors.

lr_coef <- lr_spec %>%
  fit(Direction_Today ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = BTC_train)
lr_coef$fit %>% 
  tidy() %>%
  mutate(sign_level = ifelse(p.value < 0.05, "Yes", "No"))

## # A tibble: 6 x 6
##   term        estimate std.error statistic p.value sign_level
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>     
## 1 (Intercept)  0.115      0.0444     2.59  0.00968 Yes       
## 2 Lag1        -0.0273     0.0104    -2.63  0.00845 Yes       
## 3 Lag2         0.0179     0.0102     1.75  0.0797  No        
## 4 Lag3         0.0205     0.0103     1.98  0.0480  Yes       
## 5 Lag4        -0.00550    0.0105    -0.526 0.599   No        
## 6 Lag5         0.00650    0.0102     0.634 0.526   No

  # filter(sign_level == "Yes")

Revised Model Formula

We are now only using Lag1 and Lag3 as regressors.

rec <- recipe(Direction_Today ~ Lag1 + Lag3, data = BTC_train) 
  # step_normalize(all_numeric_predictors())
rec

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2

Revised Workflow

lr_wf2 <- workflow() %>%
  add_recipe(rec) %>%
  add_model(lr_spec)

Model Fitting

lr_fit2 <- fit(lr_wf2, data = BTC_train)
lr_fit2

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag3  
##     0.12063     -0.02748      0.01976  
## 
## Degrees of Freedom: 2105 Total (i.e. Null);  2103 Residual
## Null Deviance:       2912 
## Residual Deviance: 2902  AIC: 2908

Final Accuracy

lr_fit2 <- fit(lr_wf2, data = BTC_train)
augment(lr_fit2, new_data = BTC_test) %>%
  accuracy(truth = Direction_Today, estimate = .pred_class) %>%
  mutate(description = "LR Model with Lag1 and Lag3") -> acc_lr_fit2
acc_lr_fit2

## # A tibble: 1 x 4
##   .metric  .estimator .estimate description                
##   <chr>    <chr>          <dbl> <chr>                      
## 1 accuracy binary         0.550 LR Model with Lag1 and Lag3

Confusion Matricies

As shown in the confusion matrix, the logistic regression model operate the two Lag variables as predictors. The predictions are represented by the rows of the confusion matrix, while the ground truth is represented by the columns. The up-left area represents true Down, and the down-right area represents true Up; these two values are a leading indicator for determining whether the model is right or wrong. The true Down is 42, which greater then the false Down 39, and the true UP is 344, which is greater than false negative 277. Compared to the observed data points, the predictions don’t work really well. The model just splitting approximately 50 % to the down and 50 % to the up direction.

augment(lr_fit2, new_data = BTC_test) %>%
  conf_mat(truth = Direction_Today, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Linear Discriminant Analysis Model

Linear Discriminant Model Specification

# set LDA specification
lda_spec <- discrim_linear() %>%
  set_mode("classification") %>%
  set_engine("MASS") # the default is "MASS", otherwise is "mda"
lda_spec

## Linear Discriminant Model Specification (classification)
## 
## Computational engine: MASS

Workflow

lda_wf1 <- workflow() %>%
  add_recipe(rec) %>%
  add_model(lda_spec)

Model Fitting

lda_fit1 <- fit(lda_wf1, data = BTC_train)
lda_fit1

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_linear()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## lda(..y ~ ., data = data)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4710351 0.5289649 
## 
## Group means:
##           Lag1      Lag3
## Down 0.6237581 0.1031490
## Up   0.1207130 0.4640411
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.1887633
## Lag3  0.1356100

Final Accuracy

augment(lda_fit1, new_data = BTC_test) %>%
  accuracy(truth = Direction_Today, estimate = .pred_class) %>%
  mutate(description = "LDA Model with Lag1 and Lag3") -> acc_lda_fit1
acc_lda_fit1

## # A tibble: 1 x 4
##   .metric  .estimator .estimate description                 
##   <chr>    <chr>          <dbl> <chr>                       
## 1 accuracy binary         0.547 LDA Model with Lag1 and Lag3

Confusion Matricies

The true Down is 40, which greater then the false Down 39, and the true UP is 344, which is greater than false negative 279 in LDA model.

augment(lda_fit1, new_data = BTC_test) %>%
  conf_mat(truth = Direction_Today, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Quadratic Discriminant Analysis Model

Quadratic Discriminant Model Specification

# set QDA specification
qda_spec <- discrim_regularized() %>%
  set_mode("classification") %>%
  set_args(frac_common_cov = 0, frac_identity = 0) %>%
  set_engine("klaR")
qda_spec

## Regularized Discriminant Model Specification (classification)
## 
## Main Arguments:
##   frac_common_cov = 0
##   frac_identity = 0
## 
## Computational engine: klaR

Workflow

qda_wf1 <- workflow() %>%
  add_recipe(rec) %>%
  add_model(qda_spec)

Model Fitting

qda_fit1 <- fit(qda_wf1, data = BTC_train)
qda_fit1

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_regularized()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Call: 
## rda(formula = ..y ~ ., data = data, lambda = ~0, gamma = ~0)
## 
## Regularization parameters: 
##  gamma lambda 
##      0      0 
## 
## Prior probabilities of groups: 
##      Down        Up 
## 0.4710351 0.5289649 
## 
## Misclassification rate: 
##        apparent: 47.673 %

Final Accuracy

augment(qda_fit1, new_data = BTC_test) %>%
  accuracy(truth = Direction_Today, estimate = .pred_class) %>%
  mutate(description = "QDA Model with Lag1 and Lag3") -> acc_qda_fit1
acc_qda_fit1

## # A tibble: 1 x 4
##   .metric  .estimator .estimate description                 
##   <chr>    <chr>          <dbl> <chr>                       
## 1 accuracy binary         0.524 QDA Model with Lag1 and Lag3

Confusion Matricies

The true Down is 188, which less then the false Down 203, and the true UP is 180, which is greater than false negative 131 in QDA model.

augment(qda_fit1, new_data = BTC_test) %>%
  conf_mat(truth = Direction_Today, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Cross Validation in KNN Model

Knn Model specification

We will use neighbors = tune() in order to find the optimal K in the KNN model.

knn_spec <- nearest_neighbor(neighbors = tune()) %>%
  set_mode("classification") %>%
  set_engine("kknn")
knn_spec

## K-Nearest Neighbor Model Specification (classification)
## 
## Main Arguments:
##   neighbors = tune()
## 
## Computational engine: kknn

Centering and Scaling Formula

Since we are using a K-nearest neighbor model, it is importance that the variables are centered and scaled to make sure that the variables have a uniform influence. We can accomplish this transformation with step_normalize(), which does centering and scaling in one go.

rec_norm <- recipe(Direction_Today ~ Lag1 + Lag3, data = BTC_train) %>%
  step_normalize(all_numeric_predictors())
rec_norm

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2
## 
## Operations:
## 
## Centering and scaling for all_numeric_predictors()

Workflow

knn_wf1 <- workflow() %>%
  add_recipe(rec_norm) %>%
  add_model(knn_spec)

V-Fold Cross Validation

Create the Cross-Validation term in order to use in the following tune_grid() session later, the number of default folds is 10.

set.seed(1100)
cv10 <- vfold_cv(BTC_train, strata = Direction_Today, v = 10)

# param_grid <- grid_regular(neighbors(), levels = 100)
param_grid <- tibble(neighbors = 1:100)
param_grid

## # A tibble: 100 x 1
##    neighbors
##        <int>
##  1         1
##  2         2
##  3         3
##  4         4
##  5         5
##  6         6
##  7         7
##  8         8
##  9         9
## 10        10
## # … with 90 more rows

Tune the Best Neighbors

tune_res1 <- tune_grid(
  object = knn_wf1,
  resamples = cv10,
  grid = param_grid, control = control_grid(verbose = TRUE)
)

Visualization

When the Nearest Neighbors value is 1, the accuracy and will be the best. The ROC curve has the best performance when the Nearest Neighbors value is 8.

autoplot(tune_res1) +
  geom_vline(xintercept = 1, color = "red") +
  geom_vline(xintercept = 8, color = "blue")

tune_res1 %>%
  show_best(metric = "accuracy")

## # A tibble: 5 x 7
##   neighbors .metric  .estimator  mean     n std_err .config               
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                 
## 1         1 accuracy binary     0.528    10 0.00624 Preprocessor1_Model001
## 2         2 accuracy binary     0.528    10 0.00624 Preprocessor1_Model002
## 3         3 accuracy binary     0.528    10 0.00624 Preprocessor1_Model003
## 4         4 accuracy binary     0.527    10 0.0119  Preprocessor1_Model004
## 5         5 accuracy binary     0.526    10 0.0123  Preprocessor1_Model005

tune_res1 %>%
  show_best(metric = "roc_auc")

## # A tibble: 5 x 7
##   neighbors .metric .estimator  mean     n std_err .config               
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1         8 roc_auc binary     0.532    10  0.0118 Preprocessor1_Model008
## 2         5 roc_auc binary     0.532    10  0.0119 Preprocessor1_Model005
## 3         7 roc_auc binary     0.532    10  0.0124 Preprocessor1_Model007
## 4         6 roc_auc binary     0.531    10  0.0127 Preprocessor1_Model006
## 5         9 roc_auc binary     0.531    10  0.0113 Preprocessor1_Model009

Fit the model

We will choose the best accuracy to fit the model. That is, use 1 as the k-nearest neighbors to fit the model.

best_accuracy1 <- select_best(tune_res1, metric = "accuracy")
best_accuracy1

## # A tibble: 1 x 2
##   neighbors .config               
##       <int> <chr>                 
## 1         1 Preprocessor1_Model001

knn_final1 <- finalize_workflow(knn_wf1, best_accuracy1)

knn_fit1 <- fit(knn_final1, data = BTC_train)
knn_fit1

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(1L,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.4734093
## Best kernel: optimal
## Best k: 1

Final Accuracy

augment(knn_fit1, new_data = BTC_test) %>%
  accuracy(truth = Direction_Today, estimate = .pred_class) %>%
  mutate(description = "KNN Model with Lag1 and Lag3 (CV)") -> acc_knn_fit1
acc_knn_fit1

## # A tibble: 1 x 4
##   .metric  .estimator .estimate description                      
##   <chr>    <chr>          <dbl> <chr>                            
## 1 accuracy binary         0.511 KNN Model with Lag1 and Lag3 (CV)

Confusion Matricies

The true Down is 148, which less then the false Down 173, and the true UP is 210, which is greater than false negative 170 in KNN with 1 neighbor model.

augment(knn_fit1, new_data = BTC_test) %>%
  conf_mat(truth = Direction_Today, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Bootstrapping in KNN Model

Bootstrap Sampling

Use 25 bootstraps as the resamples data set.

set.seed(1200)
boots25 <- bootstraps(BTC_train, strata = Direction_Today, times = 25)

Tune the Best Neighbors

tune_res2 <- tune_grid(
  object = knn_wf1,
  resamples = boots25,
  grid = param_grid, control = control_grid(verbose = TRUE)
)

Visualization

We can see the blue line that if the Nearest Neighbors is 1, the accuracy is the highest in bootstrap resampling. Thus, the best hyperparameter of this model might be 1.

autoplot(tune_res2) +
  geom_vline(xintercept = 1, color = "red") +
  geom_vline(xintercept = 5, color = "blue")

tune_res2 %>%
  show_best(metric = "accuracy")

## # A tibble: 5 x 7
##   neighbors .metric  .estimator  mean     n std_err .config               
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                 
## 1         1 accuracy binary     0.515    25 0.00294 Preprocessor1_Model001
## 2         2 accuracy binary     0.515    25 0.00294 Preprocessor1_Model002
## 3         3 accuracy binary     0.515    25 0.00294 Preprocessor1_Model003
## 4         5 accuracy binary     0.515    25 0.00332 Preprocessor1_Model005
## 5         6 accuracy binary     0.513    25 0.00324 Preprocessor1_Model006

tune_res2 %>%
  show_best(metric = "roc_auc")

## # A tibble: 5 x 7
##   neighbors .metric .estimator  mean     n std_err .config               
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1         5 roc_auc binary     0.517    25 0.00346 Preprocessor1_Model005
## 2         4 roc_auc binary     0.517    25 0.00354 Preprocessor1_Model004
## 3         6 roc_auc binary     0.517    25 0.00357 Preprocessor1_Model006
## 4         3 roc_auc binary     0.517    25 0.00328 Preprocessor1_Model003
## 5         7 roc_auc binary     0.516    25 0.00350 Preprocessor1_Model007

Fit the model

The best accuracy value to fit the KNN model will be chosen. That is, use 10 as the k-nearest neighbors to fit the model.

best_accuracy2 <- select_best(tune_res2, metric = "accuracy")
best_accuracy2

## # A tibble: 1 x 2
##   neighbors .config               
##       <int> <chr>                 
## 1         1 Preprocessor1_Model001

knn_final2 <- finalize_workflow(knn_wf1, best_accuracy2)

knn_fit2 <- fit(knn_final2, data = BTC_train)
knn_fit2

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(1L,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.4734093
## Best kernel: optimal
## Best k: 1

Final Accuracy

augment(knn_fit2, new_data = BTC_test) %>%
  accuracy(truth = Direction_Today, estimate = .pred_class) %>%
  mutate(description = "KNN Model with Lag1 and Lag3 (Bootstrap)") -> acc_knn_fit2
acc_knn_fit2

## # A tibble: 1 x 4
##   .metric  .estimator .estimate description                             
##   <chr>    <chr>          <dbl> <chr>                                   
## 1 accuracy binary         0.511 KNN Model with Lag1 and Lag3 (Bootstrap)

Confusion Matricies

The true Down is 149, which less then the false Down 173, and the true UP is 210, which is greater than false negative 170 in KNN with 1 neighbor model. The outcome is the same as with the KNN model of Cross-Validation.

augment(knn_fit2, new_data = BTC_test) %>%
  conf_mat(truth = Direction_Today, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

8. Conclusion

Logistic regression model is the best predictor of up and down direction in Bitcoin. Simply providing Lag1 and Lag3 as predictors causes the model to correctly predict the trend in 55 percent of the cases.

com_table <- bind_rows(acc_lr_fit1, acc_lr_fit2, acc_lda_fit1,
                       acc_qda_fit1, acc_knn_fit1, acc_knn_fit2)
com_table %>%
  arrange(desc(.estimate, everything()))

## # A tibble: 6 x 4
##   .metric  .estimator .estimate description                             
##   <chr>    <chr>          <dbl> <chr>                                   
## 1 accuracy binary         0.550 LR Model with Lag1 and Lag3             
## 2 accuracy binary         0.547 LDA Model with Lag1 and Lag3            
## 3 accuracy binary         0.540 LR Model with Lag1 to Lag5              
## 4 accuracy binary         0.524 QDA Model with Lag1 and Lag3            
## 5 accuracy binary         0.511 KNN Model with Lag1 and Lag3 (CV)       
## 6 accuracy binary         0.511 KNN Model with Lag1 and Lag3 (Bootstrap)

Furthermore, the ROC curve shows that the outcomes of these six models are not very different.

models <- list("Logistic Regression 1" = lr_fit1,
               "Logistic Regression 2" = lr_fit2,
               "LDA" = lda_fit1,
               "QDA" = qda_fit1,
               "KNN CV" = knn_fit1,
               "KNN Boots" = knn_fit2
               )
preds <- imap_dfr(models, augment, 
                  new_data = BTC_test, .id = "model")

#preds %>%
 # select(model, Direction_Today, .pred_class, .pred_Down, .pred_Up)
#multi_metric <- metric_set(accuracy, sensitivity, specificity)
#preds %>%
 # group_by(model) %>%
  #multi_metric(truth = Direction_Today, estimate = .pred_class)
preds %>%
  group_by(model) %>%
  roc_curve(Direction_Today, .pred_Down) %>%
  autoplot() +
  labs(y = "Sensitivity (true positives/all actual positives)",
       x = "Specificity (true negatives/all actual negatives)")

9. References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014).An introduction to statistical learning: with applications in R / Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.(Corrected edition.). Springer.
https://financeformulas.net/Total-Stock-Return.html
https://www.tmwr.org/

Statistical Machine Learning for Bitcoin Prediction

Statistical Machine Learning Project

Yunting Chiu, Haiman Wong

2021-06-25

1. Abstract

2. Install the Libraries

3. BitCoin Data

Data Tidying and Cleaning

Data Dictionary

4. Data Analysis

Exploratory Data Analysis

5. Data Visualization

Correlation Table

Bitcoin to the moon?

Bitcoin to the moon (with Opening Price)

Variation of Date vs Return Today

Frequency of Ups vs Downs in a Pie Chart

6. Decision Tree and Random Forest

Decision Trees

Decision Tree Model 1

Decision Tree Model 2: Hyperparameter

Decision Tree Model 3

Decision Tree Model 4

Random Forest

7. Data Modeling

Multiple Linear Regression Model

Simple Linear Regression Model 1

Simple Linear Regression Model 2

Simple Linear Regression Model 3

Simple Linear Regression Model 4

Simple Linear Regression Model 5

Data Splitting

Logistic Regression Model

Logistic Model Specification

Model Formula

Workflow

Model Fitting

Final Accuracy

Confusion Matrix

Coefficients Table

Revised Model Formula

Revised Workflow

Model Fitting

Final Accuracy

Confusion Matricies

Linear Discriminant Analysis Model

Linear Discriminant Model Specification

Workflow

Model Fitting

Final Accuracy

Confusion Matricies

Quadratic Discriminant Analysis Model

Quadratic Discriminant Model Specification

Workflow

Model Fitting

Final Accuracy

Confusion Matricies

Cross Validation in KNN Model

Knn Model specification

Centering and Scaling Formula

Workflow

V-Fold Cross Validation

Tune the Best Neighbors

Visualization

Fit the model

Final Accuracy

Confusion Matricies

Bootstrapping in KNN Model

Bootstrap Sampling

Tune the Best Neighbors

Visualization

Fit the model

Final Accuracy

Confusion Matricies

8. Conclusion

9. References