Multiple Linear Regression on Bitcoin Price

Kris Kilgroe

2022-06-23

Disclaimer

This is not financial advice. This is a demonstration of multiple linear regression applied to Bitcoin Price.

Introduction

Multiple linear regression is a powerful technique in determining the correlation between factors influencing outputs such as sales or advertising revenue, attendance, sports performance, medical statistics, among other items.

Today, we will apply multiple linear regression to BTC Price valued in USD (the dependent variable)1 , in order to determine significant correlating factors.

Let’s consider the following independent variables:

  • BTC Mining Difficulty2
  • BTC.Mining.Profitability3
  • BTC.Atm.Count.US4
  • Federal Funds Rate as a percentage5
  • M1 Money supply in billions6
  • Global Inflation Rate7

I created a custom dataset8 compiling all of the monthly intervals for the aforementioned data, with the data range June 2017 to December 2021.

I chose these variables out of curiosity to see how a mix of Bitcoin-specific variables (mining difficulty and mining profitability), Bitcoin ATM count, along with macroeconomic variables (including the recently publicized Federal Funds Rate) may or may not be significant.

Although I chose these variables based on some common knowledge and my perspective on the Bitcoin ecosystem, out of strict objectivity I am initially agnostic to which independent variables will or will not provide any correlation to Bitcoin Price.

Process

We will conduct a classic multivariable regression analysis in sequence by

  • importing and cleaning the data
  • splitting the data into training and test (validation) sets
  • building the model
  • testing assumptions and the model’s accuracy
  • drawing sensible conclusions for the model’s efficacy

Objective: Predict BTC Price for January 2022 (one month past the data end date)

Section 1: Data Cleaning

Import libraries.

library(readxl)           #to read in excel file dataset
library(tidyverse)        #for required functions to transform data (stemming from dplyr & ggplot packages)
library(caret)            #for training and holdout sets
library(broom)            #for working with the model output
library(corrplot)         #to construct the required correlation matrix
library(car)              #to augment the correlation matrix assumption testing 
library(scatterplot3d)    #for 3D plotting
library(olsrr)            #for checking residuals for normal distribution
library(lmtest)           #for checking Homoscedasticity
library(kableExtra)       #to enhance the kable package


Read in the data.

BTC <- read_excel("data.xlsx")


Check the row count and for n/a or null values.

kbl(nrow(BTC))
is.na(BTC)
is.null(BTC)

The output shows 55 rows and no missing values.

To ensure the achievement of homoscedasticity (a required model testing assumption after we build our model), transform BTC.Price logarithmically.

BTC$BTC.Price.Log = log(BTC$BTC.Price)

#reorder columns 
BTC <- BTC %>% relocate("BTC.Price.Log", .before = "BTC.Mining.Difficulty")

We will retain BTC.Price for reference, but will not employ it in the linear regression model analysis as the independent variable. The transformed log column for BTC.Price will serve as the independent variable.

A good practice is to delete any outliers that may skew the data.  Employing the Cook's Distance technique will pinpoint and delete outliers.9

mod <- lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability 
          + BTC.Atm.Count.US + Fed.Fund.Rate + M1.Money.Supply 
          + Global.Inflation.Rate, data = BTC)
BTC$cooksd <- cooks.distance(mod)

#define outliers based on 4/n criteria
BTC$outlier <- ifelse(BTC$cooksd < 4/nrow(BTC), "keep","delete")

#inspect the dataset, be sure the range is 1:55 for all observations
BTC[1:55, ]
## # A tibble: 55 × 11
##    `Month-Year` BTC.Price BTC.Price.Log BTC.Mining.Difficulty BTC.Mining.Profit…
##    <chr>            <dbl>         <dbl>                 <dbl>              <dbl>
##  1 Jun 2017         2547.          7.84                  0.71               1.21
##  2 Jul 2017         2719.          7.91                  0.86               0.88
##  3 Aug 2017         4676.          8.45                  0.89               1.62
##  4 Sep 2017         4208.          8.34                  1.1                0.84
##  5 Oct 2017         6131.          8.72                  1.45               1.11
##  6 Nov 2017        10543.          9.26                  1.35               2.08
##  7 Dec 2017        13062.          9.48                  1.87               2.28
##  8 Jan 2018         9914.          9.20                  2.6                0.91
##  9 Feb 2018        10903.          9.30                  3.01               0.85
## 10 Mar 2018         6970.          8.85                  3.46               0.46
## # … with 45 more rows, and 6 more variables: BTC.Atm.Count.US <dbl>,
## #   Fed.Fund.Rate <dbl>, M1.Money.Supply <dbl>, Global.Inflation.Rate <dbl>,
## #   cooksd <dbl>, outlier <chr>

This is confirmed with A tibble: 55 x 11 output.

Display the results showing the outliers.

df_BTC_cooksd <-BTC[1:55, ]

df_BTC_cooksd %>% 
  kbl() %>% 
  kable_styling() %>% 
  row_spec(which(df_BTC_cooksd$outlier == "delete"), bold = T, 
           color = "white", background = "red")
Month-Year BTC.Price BTC.Price.Log BTC.Mining.Difficulty BTC.Mining.Profitability BTC.Atm.Count.US Fed.Fund.Rate M1.Money.Supply Global.Inflation.Rate cooksd outlier
Jun 2017 2547.32 7.842797 0.71 1.21 766.0 1.04 3524.9 3.23 0.1271788 delete
Jul 2017 2718.62 7.907880 0.86 0.88 826.0 1.15 3548.6 3.23 0.0497825 keep
Aug 2017 4676.14 8.450228 0.89 1.62 853.0 1.16 3587.8 3.23 0.0611313 keep
Sep 2017 4207.83 8.344702 1.10 0.84 908.0 1.15 3571.2 3.23 0.0000132 keep
Oct 2017 6130.56 8.721041 1.45 1.11 948.0 1.15 3606.3 3.23 0.0053589 keep
Nov 2017 10542.66 9.263185 1.35 2.08 1052.0 1.16 3631.0 3.23 0.0013759 keep
Dec 2017 13062.15 9.477474 1.87 2.28 1125.0 1.30 3610.6 3.23 0.0068056 keep
Jan 2018 9914.47 9.201751 2.60 0.91 1216.0 1.41 3656.2 3.59 0.0234163 keep
Feb 2018 10903.02 9.296795 3.01 0.85 1319.0 1.42 3619.0 3.59 0.0354220 keep
Mar 2018 6970.36 8.849422 3.46 0.46 1475.0 1.51 3663.4 3.59 0.0141478 keep
Apr 2018 9268.72 9.134401 4.02 0.62 1624.0 1.69 3655.2 3.59 0.0143643 keep
May 2018 7363.28 8.904261 4.31 0.44 1752.0 1.70 3650.8 3.59 0.0077092 keep
Jun 2018 6189.31 8.730579 5.08 0.31 1880.0 1.82 3652.2 3.59 0.0010366 keep
Jul 2018 8166.39 9.007782 5.95 0.34 1982.0 1.91 3677.5 3.59 0.0062702 keep
Aug 2018 6938.91 8.844900 6.73 0.28 1988.0 1.91 3693.2 3.59 0.0006102 keep
Sep 2018 6576.61 8.791275 7.15 0.25 2066.0 1.95 3701.8 3.59 0.0000162 keep
Oct 2018 6276.46 8.744561 7.18 0.23 2119.0 2.19 3728.2 3.59 0.0002029 keep
Nov 2018 3952.45 8.282091 6.65 0.15 2183.0 2.20 3705.6 3.59 0.0196632 keep
Dec 2018 3689.56 8.213263 5.62 0.17 2248.0 2.27 3763.4 3.59 0.0246806 keep
Jan 2019 3441.03 8.143526 5.81 0.17 2301.0 2.40 3754.8 3.47 0.0299279 keep
Feb 2019 3772.94 8.235610 6.07 0.16 2428.0 2.40 3762.9 3.47 0.0195361 keep
Mar 2019 4081.22 8.314151 6.38 0.17 2487.0 2.41 3733.6 3.47 0.0142992 keep
Apr 2019 5151.43 8.547030 6.35 0.20 2602.0 2.42 3775.0 3.47 0.0011489 keep
May 2019 8287.04 9.022448 7.46 0.27 2760.0 2.39 3783.2 3.47 0.0100899 keep
Jun 2019 12024.08 9.394667 7.93 0.37 2915.0 2.38 3824.4 3.47 0.0462735 keep
Jul 2019 9572.74 9.166675 9.01 0.32 3061.0 2.40 3856.8 3.47 0.0107447 keep
Aug 2019 9590.74 9.168553 10.18 0.25 3271.0 2.13 3858.9 3.47 0.0055603 keep
Sep 2019 8085.71 8.997854 12.76 0.16 3332.0 2.04 3900.0 3.47 0.0016219 keep
Oct 2019 9226.00 9.129781 13.69 0.16 3508.0 1.83 3935.3 3.47 0.0000842 keep
Nov 2019 7729.27 8.952770 12.97 0.14 3752.0 1.55 3955.3 3.47 0.0015848 keep
Dec 2019 7251.28 8.888933 12.95 0.14 3943.0 1.55 4007.1 3.47 0.0041594 keep
Jan 2020 9545.08 9.163781 15.47 0.15 4199.0 1.55 3993.6 3.18 0.0053137 keep
Feb 2020 8778.47 9.080057 15.49 0.13 4548.0 1.58 4002.9 3.18 0.0008524 keep
Mar 2020 6483.74 8.777053 13.91 0.12 4859.0 0.65 4261.9 3.18 0.0008725 keep
Apr 2020 8773.11 9.079447 15.96 0.15 5222.0 0.05 4779.8 3.18 0.0262619 keep
May 2020 9688.32 9.178676 15.14 0.08 5535.0 0.05 16232.9 3.18 0.0000022 keep
Jun 2020 9188.06 9.125660 15.78 0.07 5863.0 0.08 16563.5 3.18 0.0035458 keep
Jul 2020 11118.92 9.316403 16.85 0.11 6320.0 0.09 16763.8 3.18 0.0004197 keep
Aug 2020 11657.00 9.363662 17.56 0.09 6692.0 0.10 16896.7 3.18 0.0004485 keep
Sep 2020 10764.28 9.283988 19.31 0.08 7396.0 0.09 17161.5 3.18 0.0192599 keep
Oct 2020 13573.71 9.515890 20.00 0.13 8323.0 0.09 17365.3 3.18 0.0049156 keep
Nov 2020 18114.41 9.804463 19.16 0.12 9096.0 0.09 17626.7 3.18 0.0159129 keep
Dec 2020 28768.84 10.267048 18.60 0.22 26722.7 0.09 17834.5 3.18 0.1374381 delete
Jan 2021 34622.37 10.452255 20.82 0.24 44349.4 0.09 18107.1 4.35 0.0541379 keep
Feb 2021 46642.61 10.750270 21.72 0.27 61976.1 0.08 18367.8 4.35 0.0018537 keep
Mar 2021 58734.48 10.980782 21.87 0.39 79602.8 0.07 18641.4 4.35 0.0024732 keep
Apr 2021 53260.30 10.882946 23.58 0.36 97229.5 0.07 18927.7 4.35 0.0083401 keep
May 2021 35749.66 10.484296 21.05 0.23 114856.2 0.06 19259.0 4.35 0.0125123 keep
Jun 2021 35945.79 10.489767 19.93 0.24 132482.9 0.08 19318.8 4.35 0.0045121 keep
Jul 2021 41157.15 10.625153 14.49 0.36 150109.6 0.10 19497.1 4.35 0.0412099 keep
Aug 2021 47663.02 10.771911 17.62 0.35 167736.3 0.09 19746.0 4.35 0.0134071 keep
Sep 2021 41412.61 10.631341 18.99 0.34 185363.0 0.08 19898.4 4.35 0.0001254 keep
Oct 2021 61374.28 11.024746 21.66 0.36 202989.7 0.08 20061.8 4.35 0.0081369 keep
Nov 2021 57834.36 10.965338 22.34 0.30 220616.4 0.08 20273.9 4.35 0.0025818 keep
Dec 2021 47128.47 10.760633 24.27 0.29 238243.1 0.08 20423.6 4.35 0.0856546 delete

We find three identified outliers in the added outlier column.

Next, we will delete the outliers from the dataset.

#redundant check of outlier count
outliers <- df_BTC_cooksd == 'delete'
sum(outliers) 
## [1] 3
#remove the three outliers from the dataframe
BTC2 <- subset(BTC, outlier != "delete")

#verify that there are 52 observations remaining
nrow(BTC2)
## [1] 52
#Delete extraneous columns that resulted from the Cook's Distance formula
BTC3 <- BTC2[c(-10,-11)]


This is our complete dataset in all it’s glory.

BTC3 %>% 
  kbl() %>% 
  kable_styling()
Month-Year BTC.Price BTC.Price.Log BTC.Mining.Difficulty BTC.Mining.Profitability BTC.Atm.Count.US Fed.Fund.Rate M1.Money.Supply Global.Inflation.Rate
Jul 2017 2718.62 7.907880 0.86 0.88 826.0 1.15 3548.6 3.23
Aug 2017 4676.14 8.450228 0.89 1.62 853.0 1.16 3587.8 3.23
Sep 2017 4207.83 8.344702 1.10 0.84 908.0 1.15 3571.2 3.23
Oct 2017 6130.56 8.721041 1.45 1.11 948.0 1.15 3606.3 3.23
Nov 2017 10542.66 9.263185 1.35 2.08 1052.0 1.16 3631.0 3.23
Dec 2017 13062.15 9.477474 1.87 2.28 1125.0 1.30 3610.6 3.23
Jan 2018 9914.47 9.201751 2.60 0.91 1216.0 1.41 3656.2 3.59
Feb 2018 10903.02 9.296795 3.01 0.85 1319.0 1.42 3619.0 3.59
Mar 2018 6970.36 8.849422 3.46 0.46 1475.0 1.51 3663.4 3.59
Apr 2018 9268.72 9.134401 4.02 0.62 1624.0 1.69 3655.2 3.59
May 2018 7363.28 8.904261 4.31 0.44 1752.0 1.70 3650.8 3.59
Jun 2018 6189.31 8.730579 5.08 0.31 1880.0 1.82 3652.2 3.59
Jul 2018 8166.39 9.007782 5.95 0.34 1982.0 1.91 3677.5 3.59
Aug 2018 6938.91 8.844900 6.73 0.28 1988.0 1.91 3693.2 3.59
Sep 2018 6576.61 8.791275 7.15 0.25 2066.0 1.95 3701.8 3.59
Oct 2018 6276.46 8.744561 7.18 0.23 2119.0 2.19 3728.2 3.59
Nov 2018 3952.45 8.282091 6.65 0.15 2183.0 2.20 3705.6 3.59
Dec 2018 3689.56 8.213263 5.62 0.17 2248.0 2.27 3763.4 3.59
Jan 2019 3441.03 8.143526 5.81 0.17 2301.0 2.40 3754.8 3.47
Feb 2019 3772.94 8.235610 6.07 0.16 2428.0 2.40 3762.9 3.47
Mar 2019 4081.22 8.314151 6.38 0.17 2487.0 2.41 3733.6 3.47
Apr 2019 5151.43 8.547030 6.35 0.20 2602.0 2.42 3775.0 3.47
May 2019 8287.04 9.022448 7.46 0.27 2760.0 2.39 3783.2 3.47
Jun 2019 12024.08 9.394667 7.93 0.37 2915.0 2.38 3824.4 3.47
Jul 2019 9572.74 9.166675 9.01 0.32 3061.0 2.40 3856.8 3.47
Aug 2019 9590.74 9.168553 10.18 0.25 3271.0 2.13 3858.9 3.47
Sep 2019 8085.71 8.997854 12.76 0.16 3332.0 2.04 3900.0 3.47
Oct 2019 9226.00 9.129781 13.69 0.16 3508.0 1.83 3935.3 3.47
Nov 2019 7729.27 8.952770 12.97 0.14 3752.0 1.55 3955.3 3.47
Dec 2019 7251.28 8.888933 12.95 0.14 3943.0 1.55 4007.1 3.47
Jan 2020 9545.08 9.163781 15.47 0.15 4199.0 1.55 3993.6 3.18
Feb 2020 8778.47 9.080057 15.49 0.13 4548.0 1.58 4002.9 3.18
Mar 2020 6483.74 8.777053 13.91 0.12 4859.0 0.65 4261.9 3.18
Apr 2020 8773.11 9.079447 15.96 0.15 5222.0 0.05 4779.8 3.18
May 2020 9688.32 9.178676 15.14 0.08 5535.0 0.05 16232.9 3.18
Jun 2020 9188.06 9.125660 15.78 0.07 5863.0 0.08 16563.5 3.18
Jul 2020 11118.92 9.316403 16.85 0.11 6320.0 0.09 16763.8 3.18
Aug 2020 11657.00 9.363662 17.56 0.09 6692.0 0.10 16896.7 3.18
Sep 2020 10764.28 9.283988 19.31 0.08 7396.0 0.09 17161.5 3.18
Oct 2020 13573.71 9.515890 20.00 0.13 8323.0 0.09 17365.3 3.18
Nov 2020 18114.41 9.804463 19.16 0.12 9096.0 0.09 17626.7 3.18
Jan 2021 34622.37 10.452255 20.82 0.24 44349.4 0.09 18107.1 4.35
Feb 2021 46642.61 10.750270 21.72 0.27 61976.1 0.08 18367.8 4.35
Mar 2021 58734.48 10.980782 21.87 0.39 79602.8 0.07 18641.4 4.35
Apr 2021 53260.30 10.882946 23.58 0.36 97229.5 0.07 18927.7 4.35
May 2021 35749.66 10.484296 21.05 0.23 114856.2 0.06 19259.0 4.35
Jun 2021 35945.79 10.489767 19.93 0.24 132482.9 0.08 19318.8 4.35
Jul 2021 41157.15 10.625153 14.49 0.36 150109.6 0.10 19497.1 4.35
Aug 2021 47663.02 10.771911 17.62 0.35 167736.3 0.09 19746.0 4.35
Sep 2021 41412.61 10.631341 18.99 0.34 185363.0 0.08 19898.4 4.35
Oct 2021 61374.28 11.024746 21.66 0.36 202989.7 0.08 20061.8 4.35
Nov 2021 57834.36 10.965338 22.34 0.30 220616.4 0.08 20273.9 4.35

We are now ready to to make our Training and Test sets.

Section 2: Training and Test Sets

Creating training and holdout sets is mandatory to ensure the model is fits the data well.
We use the training set to develop the model, then we will apply the model to the test set in order to determine the model’s efficacy.10

Create the training and test sets.

#make the procedure reproducible
set.seed(1)

#use 70% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(BTC3), replace=TRUE, prob=c(0.7,0.3))
train  <- BTC3[sample, ]
test   <- BTC3[!sample, ]

We will return to the test dataset after we build our model with the training data first.

We confirm that the spread is 70% allocated to the training set and 30% allocated to the test set.

nrow(train)
## [1] 35
nrow(test)
## [1] 17
#Training set
35/52
## [1] 0.6730769
#Test Set
17/52
## [1] 0.3269231

Close enough. Now, let’s test for the data’s integrity.

Section 3: Multicollinearity & Variance Inflation Factor

Multicollinearity is the phenomena where multiple independent variables are correlated, or influence each other.11 This is a problem because predictor variables should have no correlation or as little correlation as possible. We will test for this.

#subset the data only for the `independent variables`
train2 <- train[c(-1,-2, -3)]
par(mfrow=c(1,1))

#create a numeric vector
datamatrix <- cor(train2)

#make the correlation matrix
corrplot(datamatrix, method ='number')


On a cursory level, BTC.Mining.Difficulty, M1.Money.Supply, BTC.Atm.Count.US and Global.Inflation.Rate seem to have moderate to high correlation between other variables.  Should we delete any of these predictor variables at this point?

To determine which variables to scrap, if needed, we employ Variance Inflation Factor (VIF) for augmented analysis.12

#be sure to revert to the original training set to reincorporate the independent variable into the model 
VIF<-vif(lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + BTC.Atm.Count.US 
            + Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, data = train))
VIF %>% 
  kbl() %>% 
  kable_styling()
x
BTC.Mining.Difficulty 5.160656
BTC.Mining.Profitability 2.311268
BTC.Atm.Count.US 4.805880
Fed.Fund.Rate 4.974582
M1.Money.Supply 8.509970
Global.Inflation.Rate 3.940872

The VIF should be less than 10. A VIF greater than 10 indicates severe multicollinearity. Since all variables are under 10 (for some barely), we will retain all variables at this moment.13

On to the good stuff, let’s build the linear regression model!

Section 4: Modeling

Calculate the linear regression model.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + BTC.Atm.Count.US 
          + Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, data = train)
summary(model)
## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     BTC.Atm.Count.US + Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54677 -0.14387  0.03012  0.13608  0.42740 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.714e+00  6.619e-01   7.122 9.49e-08 ***
## BTC.Mining.Difficulty    7.520e-02  1.358e-02   5.538 6.38e-06 ***
## BTC.Mining.Profitability 5.905e-01  1.512e-01   3.906  0.00054 ***
## BTC.Atm.Count.US         6.591e-07  1.634e-06   0.403  0.68974    
## Fed.Fund.Rate            1.090e-01  1.052e-01   1.037  0.30864    
## M1.Money.Supply          1.757e-05  1.795e-05   0.979  0.33619    
## Global.Inflation.Rate    8.961e-01  1.993e-01   4.496  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2432 on 28 degrees of freedom
## Multiple R-squared:  0.9256, Adjusted R-squared:  0.9096 
## F-statistic: 58.02 on 6 and 28 DF,  p-value: 1.667e-14

Since the BTC.Atm.Count.US is the greatest non-significant variable (with a p-value > 0.05), we will delete this variable.

With the deleted predictor variable, we run the process again.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + Fed.Fund.Rate 
          + M1.Money.Supply + Global.Inflation.Rate, data = train)
summary(model)
## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5430 -0.1556  0.0394  0.1362  0.4288 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.531e+00  4.745e-01   9.549 1.86e-10 ***
## BTC.Mining.Difficulty    7.566e-02  1.333e-02   5.674 3.91e-06 ***
## BTC.Mining.Profitability 5.987e-01  1.476e-01   4.056 0.000344 ***
## Fed.Fund.Rate            1.057e-01  1.033e-01   1.023 0.314770    
## M1.Money.Supply          1.905e-05  1.732e-05   1.100 0.280297    
## Global.Inflation.Rate    9.478e-01  1.504e-01   6.303 6.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2396 on 29 degrees of freedom
## Multiple R-squared:  0.9251, Adjusted R-squared:  0.9122 
## F-statistic: 71.66 on 5 and 29 DF,  p-value: 2.007e-15

Since the Federal.Fund.Rate is the next largest non-significant variable (with a p-value > 0.05), we will delete this variable.

With the deleted predictor variable, we run the process again.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + M1.Money.Supply 
          + Global.Inflation.Rate, data = train)
summary(model)
## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     M1.Money.Supply + Global.Inflation.Rate, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51257 -0.13521  0.02273  0.14103  0.49358 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.598e+00  4.703e-01   9.777 7.71e-11 ***
## BTC.Mining.Difficulty    7.148e-02  1.270e-02   5.627 3.97e-06 ***
## BTC.Mining.Profitability 5.288e-01  1.310e-01   4.038 0.000344 ***
## M1.Money.Supply          7.292e-06  1.296e-05   0.563 0.577871    
## Global.Inflation.Rate    1.012e+00  1.367e-01   7.406 2.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2398 on 30 degrees of freedom
## Multiple R-squared:  0.9224, Adjusted R-squared:  0.9121 
## F-statistic: 89.18 on 4 and 30 DF,  p-value: 3.29e-16

Since the M1.Money.Supply is the next least significant variable (with a p-value > 0.05), we will delete this variable.

With the deleted predictor variable, we run the process again.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability 
          + Global.Inflation.Rate, data = train)
summary(model)
## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     Global.Inflation.Rate, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4913 -0.1487  0.0283  0.1471  0.4853 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.443317   0.376956  11.787 5.50e-13 ***
## BTC.Mining.Difficulty    0.076886   0.008222   9.351 1.55e-10 ***
## BTC.Mining.Profitability 0.557522   0.119266   4.675 5.45e-05 ***
## Global.Inflation.Rate    1.051885   0.115775   9.086 3.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2372 on 31 degrees of freedom
## Multiple R-squared:  0.9216, Adjusted R-squared:  0.914 
## F-statistic: 121.5 on 3 and 31 DF,  p-value: < 2.2e-16


Executive Decision:  I am not convinced that the Global.Inflation.Rate variable is a good metric in this particular instance. The reason being: I was not able to capture a month-to-month assessment in keeping with the data; instead, I scraped a blanket average for each year, given the resources at hand.14 

Global Inflation Rate may fluctuate month-to-month. While it may have an attractive p-value for this model, the variable is not trustworthy due to my lack of capturing better granualarity, inconsistent with the other remaining predictor variables.

Thus, due to how I perceive the Global Inflation Rate interacts with the data, I am going to delete this variable, leaving only BTC.Mining.Difficulty and BTC.Mining.Profitability as the star predictor variables.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, data = train)
summary(model)
## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62588 -0.42858  0.09613  0.35924  0.96722 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.69717    0.22158  34.738  < 2e-16 ***
## BTC.Mining.Difficulty     0.11588    0.01321   8.771 5.06e-10 ***
## BTC.Mining.Profitability  0.78244    0.21977   3.560  0.00118 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4468 on 32 degrees of freedom
## Multiple R-squared:  0.7129, Adjusted R-squared:  0.6949 
## F-statistic: 39.72 on 2 and 32 DF,  p-value: 2.136e-09
print(model)
## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, 
##     data = train)
## 
## Coefficients:
##              (Intercept)     BTC.Mining.Difficulty  BTC.Mining.Profitability  
##                   7.6972                    0.1159                    0.7824

Thus, the significant independent variables that we will retain are: BTC.Mining.Difficulty and BTC.Mining.Profitability

\({R^2}\) is 0.7129, which means that the model created explains 71.29% of the variability in the given test data set.

Our model follows the form \(Y_{0}\) = \(\beta_{0}\) + \(\beta_{1}\)\(X_{i}\) + \(\beta_{2}\)\(X_{i}\) + \(\varepsilon_{i}\)

Which transforms into:

BTC.Price.Log = 7.69 + 0.12\(X_{BTC.Mining.Difficulty}\) + 0.78\(X_{BTC.Mining.Profitability}\)

7.69 is the equation constant intercept.

A 0.12 increase in BTC.Price.Log is associated with a BTC.Mining.Difficulty increase by a factor of 1.

A 0.78 increase in BTC.Price.Log is associated with a BTC.Mining.Profitability increase by a factor of 1.

Section 5: Testing Assumptions

Check for homoscedasticity

#visual look
par(mfrow=c(1,1))
plot(model)


At a cursory glance, although there is not very much data provided in the training set (35 observations), the data seems to meet the assumption, with a seemingly constant scatter through the residual output.

However, just to be sure, verify with the Breusch-Pagan test

lmtest::bptest(model)
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.4359, df = 2, p-value = 0.4878

Since the null hypothesis of the Breusch-Pagan test is that heteroscedastiy is present, and the p value > 0.05 (not significant), we reject the null hypothesis and determine that the model’s residuals are homoscedastic.

Check for normal distribution

ols_plot_resid_hist(model)

Most of the residuals fall around zero, and the number of observations in the tails of the histogram is low, therefore the regression model is fairly normally distributed. For OCD types wanting a picture perfect bell-shaped curve spread, calm down, looks can be deceiving.

Let’s implement a Moneyball approach with some hard math.

ols_test_normality(model)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9379         0.0482 
## Kolmogorov-Smirnov         0.12          0.6507 
## Cramer-von Mises          4.0129         0.0000 
## Anderson-Darling           0.77          0.0411 
## -----------------------------------------------

Since the number of observations in our test dataset are less than 50, we use Shapiro-Wilk test. Shapiro-Wilk shows that the p-value is less than 0.05, meaning we reject the null hypothesis that the residuals are not normally distributed, and rather that the residuals are indeed normally distributed. This model is normally distributed (by a hair), but that’s good for our purposes. It gets on base.

Assess multicollinearity again.

remaining_variables <- train[c(4,5)]
par(mfrow=c(1,1))
datamatrix <- cor(remaining_variables)
corrplot(datamatrix, method ='number')


We confirm no outstanding multicollinearity with the remaining dependent variables in the model.

Additionally, they are moderately negatively correlated. A decrease in mining difficulty indicates an increase in mining profitability, and an increase in mining difficulty indicates a decrease in mining profitability.

Independence of Errors: the residuals indicate that the noise terms are independent (possessing no autocorrelation, and confirmed with the multicollinearity test). Furthermore, the noise terms are independent of the explanatory variables and there is no obvious string of positive to negative residuals, or any other definable pattern.

However, we are not finished yet.

“I gotta lotta mo!”
— Mr. T, Rocky III

Section 6: Testing Root Means Squared Predicted Error (RMSPE)

RMSPE determines how well a model fits the hold-out data. This is where we will apply the training set model on the test dataset, to validate where or not it fits the data.15

Use thepredict function with the training set-developed model.

p <- predict(model, newdata = test)
error <- test$BTC.Price.Log - p
RMSPE <- sqrt(mean(error^2))
RMSPE
## [1] 0.3822073


Recall that the standard error of the model is ~0.44 (reference Residual standard error in output).

summary(model)
## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62588 -0.42858  0.09613  0.35924  0.96722 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.69717    0.22158  34.738  < 2e-16 ***
## BTC.Mining.Difficulty     0.11588    0.01321   8.771 5.06e-10 ***
## BTC.Mining.Profitability  0.78244    0.21977   3.560  0.00118 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4468 on 32 degrees of freedom
## Multiple R-squared:  0.7129, Adjusted R-squared:  0.6949 
## F-statistic: 39.72 on 2 and 32 DF,  p-value: 2.136e-09


Since RMSPE of 0.38 is smaller than the Residual Standard error of 0.44, we determine that the training set model predicts the holdout set better than the training (validation) set, which is great news.

We’re good to hook and make a prediction on January 2022's Bitcoin Price.

Section 7: Predicting Bitcoin Price in January 2022

From Section 4: Modeling, our stellar model equation is:

BTC.Price.Log = 7.69 + 0.12\(X_{BTC.Mining.Difficulty}\) + 0.78\(X_{BTC.Mining.Profitability}\)

In January 2022, the BTC.Mining.Difficulty was ~25.17 and the BTC.Mining.Profitability was ~0.19.

Calculate the price prediction, remembering to take the inverse log function of the result to convert it to Bitcoin in USD, as we transformed the original data logarithmically in Section 1: Data Cleaning.

c <- 7.69 + (.12*25.17)+(0.78*0.19)
exp(c) #take inverse log function, since we transformed the original data logarithmically during the data cleaning phase 
## [1] 51979.26


Calculate the 95% confidence interval.16

data(train)
sample.mean <- mean(train$BTC.Price.Log)
sample.n <- length(train$BTC.Price.Log)
sample.sd <- sd(train$BTC.Price.Log)
sample.se <- sample.sd/sqrt(sample.n)
alpha = 0.05
degrees.freedom = sample.n - 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
margin.error <- t.score * sample.se
lower.bound <- sample.mean - margin.error
upper.bound <- sample.mean + margin.error
print(c(lower.bound,upper.bound))
## [1] 9.011596 9.567301


Take the reverse log to re-transform the interval into USD.

exp(9.011596)
## [1] 8197.594
exp(9.567301)
## [1] 14289.8

The model predicts Bitcoin price at $51,979.26 in January 2022, with a 95% confidence interval of [$8,197.59, $14,289.80]. Again, \({R^2}\) is 0.7129 and Standard Error is 0.4468.

Conclusion

The actual price of Bitcoin in January 2022 was between $37,188.11 to $38,483.13.17 Intuitively, given the volatility of Bitcoin in the last year, a difference between the model’s prediction and the actual price, of around $13,496.13 to $14,791.15. Additionally, the 95% confidence interval complements this range. The model seems to assume a sensible range for extrapolating future near-term Bitcoin price.

However, as the model is based on 35 observations, it is ill-equipped for any serious forecasting. See the Bitcoin Price Forecasting (Time-Series) model in the footnotes.18  Given the data, the analysis backing this model is stronger at identifying which potential explanatory variables are statistically significant, along with those that are frivolous. Yet, for its purpose, this model has offered some initiial insight on the actual hard factors that correlate to its price.

Deleted variables

These variables were statistically insignificant in correlating to Bitcoin’s price: BTC.ATM.Count.US, M1 Money Supply, and Federal Funds Rate. They all seemed to be good economic barometers, or similar, that might influence Bitcoin value. However, in context, Bitcoin operates in its own cryptographic ecosystem, as clearly shown by the Mining Difficulty and Mining Profitability variables making the final cut.

BTC ATMs in the US
The increase of BTC ATM supply nationwide does not necessarily assume that retail consumers or any other parties are using them to buy more Bitcoin, or influence its valuation in any way. It might be interesting to conduct a sample survey in most metropolitan areas and suburbs as to how often BTC ATMs (or BTMs) are actually used.

M1 Money Supply

One plausible explanation for this variable’s insignificance is due to the fact that fiat is not the only method for procuring Bitcoin. Bitcoin can be mined and swapped back and forth from other currencies (albeit through expensive marketplaces). Additionally, the size of M1 does not imply at any time that a significant portion of the supply is used to purchase Bitcoin.

Federal Funds Rate

The Federal Funds Rate is often hyped by media as a potential predictor on Bitcoin bear and bull markets, as of late. Yet, it does not appear to be in any way significant for this particular model. Future case studies involving a more granular capture of the Federal Funds Rate and its impact on Bitcoin might offer more insight. However, as the Federal Funds Rate to Bitcoin Price relation has not been clearly identified (statistically) today, one may assume that the Federal Funds Rate has little-to-no impact on Bitcoin’s price fluctuation in this model.

Global Inflation Rate

It would be prudent to revisit this variable. The Global Inflation Rate appeared to have significance in correlation to Bitcoin’s price; however, I did not trust the source data as it did not align with the other independent variables, and for that reason removed it preemptively.

A look at the remaining predictor variables

scatterplot3d(BTC3$BTC.Mining.Profitability, 
              BTC3$BTC.Mining.Difficulty, 
              BTC3$BTC.Price, 
              pch = 19, color = "black",
              xlab="Mining Profitability", ylab="Mining Difficulty", 
              zlab="Bitcoin Price",
              highlight.3d = TRUE, angle=45,
              trees, type="h",
              main="BTC Price against Mining Profitabilty and Mining Difficulty", 
              scale.y=0.7)


This figure isolates all 52 observation variables. The dataset clearly demonstrates the effect that Bitcoin Mining Difficulty has on Bitcoin Price. A higher Bitcoin Difficulty (difficulty in mining a Bitcoin block, or for dedicated hardware to solve a mathematical or algorithmic hash below a given target level, considering the amount of other miners attempting the same task), correlates with a higher Bitcoin Price. There may be a level of cognitive dissonance associated with recognizing that Mining Profitability appears to take a secondary placement (visually) in determining Bitcoin Price. Within the confines of this analysis, remember that the final multicollinearity shows Mining Profitability and Mining Difficulty being inversely correlated. Contextually, the more miners employed in mining a Bitcoin block, the less profitable it is for one single miner (or cluster of miners) to earn sizable revenue (hence, the popularity of mining pools).

Adjusting for Seasonality

Keep in mind that this analysis started with 55 observations and the model was based on 35, a very small amount of data. The adage of most data scientists and analysts is to get more (relevant) data. A greater quantity of observations, contingent on the accommodating date ranges of the other explanatory variables, would allow room for lagging the economic variables to adjust for inertia and seasonality. This would reinforce the model, and perhaps expose different patterns and behavior, in future iterations.

Ultimately, while domestic and global external macroeconomic factors influencing retail and institutional investor habits may affect Bitcoin price secondarily, the intrinsic factors identified only seem to have the most statistical signifiance, in determining Bitcoin valuation.

For any questions or comments on this analysis please contact .

Session Info

sessionInfo() 
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 20.3
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4     lmtest_0.9-40        zoo_1.8-10          
##  [4] olsrr_0.5.3          scatterplot3d_0.3-41 car_3.1-0           
##  [7] carData_3.0-5        corrplot_0.92        broom_0.8.0         
## [10] caret_6.0-92         lattice_0.20-45      forcats_0.5.1       
## [13] stringr_1.4.0        dplyr_1.0.9          purrr_0.3.4         
## [16] readr_2.1.2          tidyr_1.2.0          tibble_3.1.7        
## [19] ggplot2_3.3.6        tidyverse_1.3.1      readxl_1.4.0        
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-3     ellipsis_0.3.2       class_7.3-20        
##  [4] fs_1.5.2             rstudioapi_0.13      farver_2.1.0        
##  [7] listenv_0.8.0        prodlim_2019.11.13   fansi_1.0.3         
## [10] lubridate_1.8.0      xml2_1.3.3           codetools_0.2-18    
## [13] splines_4.2.1        knitr_1.39           jsonlite_1.8.0      
## [16] pROC_1.18.0          dbplyr_2.2.0         compiler_4.2.1      
## [19] httr_1.4.3           backports_1.4.1      assertthat_0.2.1    
## [22] Matrix_1.4-1         fastmap_1.1.0        tufte_0.12          
## [25] cli_3.3.0            htmltools_0.5.2      tools_4.2.1         
## [28] gtable_0.3.0         glue_1.6.2           reshape2_1.4.4      
## [31] Rcpp_1.0.8.3         cellranger_1.1.0     jquerylib_0.1.4     
## [34] vctrs_0.4.1          svglite_2.1.0        nlme_3.1-157        
## [37] iterators_1.0.14     timeDate_3043.102    gower_1.0.0         
## [40] xfun_0.31            globals_0.15.0       rvest_1.0.2         
## [43] lifecycle_1.0.1      goftest_1.2-3        future_1.26.1       
## [46] MASS_7.3-57          scales_1.2.0         ipred_0.9-13        
## [49] hms_1.1.1            parallel_4.2.1       yaml_2.3.5          
## [52] gridExtra_2.3        sass_0.4.1           rpart_4.1.16        
## [55] stringi_1.7.6        highr_0.9            foreach_1.5.2       
## [58] nortest_1.0-4        hardhat_1.1.0        lava_1.6.10         
## [61] rlang_1.0.2          pkgconfig_2.0.3      systemfonts_1.0.4   
## [64] evaluate_0.15        labeling_0.4.2       recipes_0.2.0       
## [67] tidyselect_1.1.2     parallelly_1.32.0    plyr_1.8.7          
## [70] magrittr_2.0.3       bookdown_0.27        R6_2.5.1            
## [73] generics_0.1.2       DBI_1.1.3            pillar_1.7.0        
## [76] haven_2.5.0          withr_2.5.0          survival_3.2-13     
## [79] abind_1.4-5          nnet_7.3-17          future.apply_1.9.0  
## [82] modelr_0.1.8         crayon_1.5.1         utf8_1.2.2          
## [85] tzdb_0.3.0           rmarkdown_2.14       grid_4.2.1          
## [88] data.table_1.14.2    rmdformats_1.0.4     ModelMetrics_1.2.2.2
## [91] webshot_0.5.3        reprex_2.0.1         digest_0.6.29       
## [94] stats4_4.2.1         munsell_0.5.0        viridisLite_0.4.0   
## [97] bslib_0.3.1

Footnotes


  1. Link to BTC Price dataset: https://docs.google.com/spreadsheets/d/10GtR7HMJutJqx54iJ4SMSoq1cPnl8pTv/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  2. Link to BTC Mining Difficulty dataset: https://docs.google.com/spreadsheets/d/1yuuzSyCeDlxePXh10_fVgtpxVC_6sChe/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  3. Link to BTC.Mining.Profitability dataset: https://docs.google.com/spreadsheets/d/1idsNZTugrcouyY-HGgITjXmjgIuOt26O/edit?usp=sharing&rtpof=true&sd=true↩︎

  4. Link to BTC ATM Count US dataset: https://docs.google.com/spreadsheets/d/1XEdA-5PBidwvEAQbSvuCm9-v47QHfU-U/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  5. Link to Federal Funds Rate dataset: https://docs.google.com/spreadsheets/d/13rb6HKnxUBTr1xInNGG-dKbTgMSaWDvL/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  6. Link to M1 Money supply dataset: https://docs.google.com/spreadsheets/d/1ofoa4V2hxwjJLhcRyR8J_mnfAcB1DIZ8/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  7. Link to Global Inflation Rate: https://docs.google.com/spreadsheets/d/1nMocnrsLW6IffI528BJYKy11mjfXKpgm/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  8. Link to custom dataset: https://docs.google.com/spreadsheets/d/1iIK7zBqW9gqbwSmxVG5J9P1kIiw3vam-/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  9. Link to Cook's Distance explanation: https://towardsdatascience.com/identifying-outliers-in-linear-regression-cooks-distance-9e212e9136a↩︎

  10. Link to Test and Validation Datasets: https://machinelearningmastery.com/difference-test-validation-datasets/↩︎

  11. Link to Multicollinearity: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/multicollinearity.html↩︎

  12. VIF defined here: https://online.stat.psu.edu/stat462/node/180/↩︎

  13. For an argument on what determined a “good” VIF level: https://www.researchgate.net/post/Multicollinearity_issues_is_a_value_less_than_10_acceptable_for_VIF↩︎

  14. Link to Global Inflation Rate: https://docs.google.com/spreadsheets/d/1nMocnrsLW6IffI528BJYKy11mjfXKpgm/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true↩︎

  15. Explanation on RMSPE at: https://www.sciencedirect.com/topics/engineering/root-mean-squared-error↩︎

  16. Calculating Confidence Intervals in R: https://bookdown.org/logan_kelly/r_practice/p09.html↩︎

  17. Bitcoin Price for January 2022: https://infomediang.com/`Bitcoin`-price-for-january-2022/↩︎

  18. See Bitcoin Price Forecasting (Time-Series) using the Prophet package at: https://rpubs.com/KrisKilgroe/BTC_price_forecasting↩︎