Multiple Linear Regression on Bitcoin Price

Kris Kilgroe

2022-06-23

Disclaimer

This is not financial advice. This is a demonstration of multiple linear regression applied to Bitcoin Price.

Introduction

Multiple linear regression is a powerful technique in determining the correlation between factors influencing outputs such as sales or advertising revenue, attendance, sports performance, medical statistics, among other items.

Today, we will apply multiple linear regression to BTC Price valued in USD (the dependent variable)¹ , in order to determine significant correlating factors.

Let’s consider the following independent variables:

BTC Mining Difficulty²
BTC.Mining.Profitability³
BTC.Atm.Count.US⁴
Federal Funds Rate as a percentage⁵
M1 Money supply in billions⁶
Global Inflation Rate⁷

I created a custom dataset⁸ compiling all of the monthly intervals for the aforementioned data, with the data range June 2017 to December 2021.

I chose these variables out of curiosity to see how a mix of Bitcoin-specific variables (mining difficulty and mining profitability), Bitcoin ATM count, along with macroeconomic variables (including the recently publicized Federal Funds Rate) may or may not be significant.

Although I chose these variables based on some common knowledge and my perspective on the Bitcoin ecosystem, out of strict objectivity I am initially agnostic to which independent variables will or will not provide any correlation to Bitcoin Price.

Process

We will conduct a classic multivariable regression analysis in sequence by

importing and cleaning the data
splitting the data into training and test (validation) sets
building the model
testing assumptions and the model’s accuracy
drawing sensible conclusions for the model’s efficacy

Objective: Predict BTC Price for January 2022 (one month past the data end date)

Section 1: Data Cleaning

Import libraries.

library(readxl)           #to read in excel file dataset
library(tidyverse)        #for required functions to transform data (stemming from dplyr & ggplot packages)
library(caret)            #for training and holdout sets
library(broom)            #for working with the model output
library(corrplot)         #to construct the required correlation matrix
library(car)              #to augment the correlation matrix assumption testing 
library(scatterplot3d)    #for 3D plotting
library(olsrr)            #for checking residuals for normal distribution
library(lmtest)           #for checking Homoscedasticity
library(kableExtra)       #to enhance the kable package

Read in the data.

BTC <- read_excel("data.xlsx")

Check the row count and for n/a or null values.

kbl(nrow(BTC))
is.na(BTC)
is.null(BTC)

The output shows 55 rows and no missing values.

To ensure the achievement of homoscedasticity (a required model testing assumption after we build our model), transform BTC.Price logarithmically.

BTC$BTC.Price.Log = log(BTC$BTC.Price)

#reorder columns 
BTC <- BTC %>% relocate("BTC.Price.Log", .before = "BTC.Mining.Difficulty")

We will retain BTC.Price for reference, but will not employ it in the linear regression model analysis as the independent variable. The transformed log column for BTC.Price will serve as the independent variable.

A good practice is to delete any outliers that may skew the data. Employing the Cook's Distance technique will pinpoint and delete outliers.⁹

mod <- lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability 
          + BTC.Atm.Count.US + Fed.Fund.Rate + M1.Money.Supply 
          + Global.Inflation.Rate, data = BTC)
BTC$cooksd <- cooks.distance(mod)

#define outliers based on 4/n criteria
BTC$outlier <- ifelse(BTC$cooksd < 4/nrow(BTC), "keep","delete")

#inspect the dataset, be sure the range is 1:55 for all observations
BTC[1:55, ]

## # A tibble: 55 × 11
##    `Month-Year` BTC.Price BTC.Price.Log BTC.Mining.Difficulty BTC.Mining.Profit…
##    <chr>            <dbl>         <dbl>                 <dbl>              <dbl>
##  1 Jun 2017         2547.          7.84                  0.71               1.21
##  2 Jul 2017         2719.          7.91                  0.86               0.88
##  3 Aug 2017         4676.          8.45                  0.89               1.62
##  4 Sep 2017         4208.          8.34                  1.1                0.84
##  5 Oct 2017         6131.          8.72                  1.45               1.11
##  6 Nov 2017        10543.          9.26                  1.35               2.08
##  7 Dec 2017        13062.          9.48                  1.87               2.28
##  8 Jan 2018         9914.          9.20                  2.6                0.91
##  9 Feb 2018        10903.          9.30                  3.01               0.85
## 10 Mar 2018         6970.          8.85                  3.46               0.46
## # … with 45 more rows, and 6 more variables: BTC.Atm.Count.US <dbl>,
## #   Fed.Fund.Rate <dbl>, M1.Money.Supply <dbl>, Global.Inflation.Rate <dbl>,
## #   cooksd <dbl>, outlier <chr>

This is confirmed with A tibble: 55 x 11 output.

Display the results showing the outliers.

df_BTC_cooksd <-BTC[1:55, ]

df_BTC_cooksd %>% 
  kbl() %>% 
  kable_styling() %>% 
  row_spec(which(df_BTC_cooksd$outlier == "delete"), bold = T, 
           color = "white", background = "red")

Month-Year	BTC.Price	BTC.Price.Log	BTC.Mining.Difficulty	BTC.Mining.Profitability	BTC.Atm.Count.US	Fed.Fund.Rate	M1.Money.Supply	Global.Inflation.Rate	cooksd	outlier
Jun 2017	2547.32	7.842797	0.71	1.21	766.0	1.04	3524.9	3.23	0.1271788	delete
Jul 2017	2718.62	7.907880	0.86	0.88	826.0	1.15	3548.6	3.23	0.0497825	keep
Aug 2017	4676.14	8.450228	0.89	1.62	853.0	1.16	3587.8	3.23	0.0611313	keep
Sep 2017	4207.83	8.344702	1.10	0.84	908.0	1.15	3571.2	3.23	0.0000132	keep
Oct 2017	6130.56	8.721041	1.45	1.11	948.0	1.15	3606.3	3.23	0.0053589	keep
Nov 2017	10542.66	9.263185	1.35	2.08	1052.0	1.16	3631.0	3.23	0.0013759	keep
Dec 2017	13062.15	9.477474	1.87	2.28	1125.0	1.30	3610.6	3.23	0.0068056	keep
Jan 2018	9914.47	9.201751	2.60	0.91	1216.0	1.41	3656.2	3.59	0.0234163	keep
Feb 2018	10903.02	9.296795	3.01	0.85	1319.0	1.42	3619.0	3.59	0.0354220	keep
Mar 2018	6970.36	8.849422	3.46	0.46	1475.0	1.51	3663.4	3.59	0.0141478	keep
Apr 2018	9268.72	9.134401	4.02	0.62	1624.0	1.69	3655.2	3.59	0.0143643	keep
May 2018	7363.28	8.904261	4.31	0.44	1752.0	1.70	3650.8	3.59	0.0077092	keep
Jun 2018	6189.31	8.730579	5.08	0.31	1880.0	1.82	3652.2	3.59	0.0010366	keep
Jul 2018	8166.39	9.007782	5.95	0.34	1982.0	1.91	3677.5	3.59	0.0062702	keep
Aug 2018	6938.91	8.844900	6.73	0.28	1988.0	1.91	3693.2	3.59	0.0006102	keep
Sep 2018	6576.61	8.791275	7.15	0.25	2066.0	1.95	3701.8	3.59	0.0000162	keep
Oct 2018	6276.46	8.744561	7.18	0.23	2119.0	2.19	3728.2	3.59	0.0002029	keep
Nov 2018	3952.45	8.282091	6.65	0.15	2183.0	2.20	3705.6	3.59	0.0196632	keep
Dec 2018	3689.56	8.213263	5.62	0.17	2248.0	2.27	3763.4	3.59	0.0246806	keep
Jan 2019	3441.03	8.143526	5.81	0.17	2301.0	2.40	3754.8	3.47	0.0299279	keep
Feb 2019	3772.94	8.235610	6.07	0.16	2428.0	2.40	3762.9	3.47	0.0195361	keep
Mar 2019	4081.22	8.314151	6.38	0.17	2487.0	2.41	3733.6	3.47	0.0142992	keep
Apr 2019	5151.43	8.547030	6.35	0.20	2602.0	2.42	3775.0	3.47	0.0011489	keep
May 2019	8287.04	9.022448	7.46	0.27	2760.0	2.39	3783.2	3.47	0.0100899	keep
Jun 2019	12024.08	9.394667	7.93	0.37	2915.0	2.38	3824.4	3.47	0.0462735	keep
Jul 2019	9572.74	9.166675	9.01	0.32	3061.0	2.40	3856.8	3.47	0.0107447	keep
Aug 2019	9590.74	9.168553	10.18	0.25	3271.0	2.13	3858.9	3.47	0.0055603	keep
Sep 2019	8085.71	8.997854	12.76	0.16	3332.0	2.04	3900.0	3.47	0.0016219	keep
Oct 2019	9226.00	9.129781	13.69	0.16	3508.0	1.83	3935.3	3.47	0.0000842	keep
Nov 2019	7729.27	8.952770	12.97	0.14	3752.0	1.55	3955.3	3.47	0.0015848	keep
Dec 2019	7251.28	8.888933	12.95	0.14	3943.0	1.55	4007.1	3.47	0.0041594	keep
Jan 2020	9545.08	9.163781	15.47	0.15	4199.0	1.55	3993.6	3.18	0.0053137	keep
Feb 2020	8778.47	9.080057	15.49	0.13	4548.0	1.58	4002.9	3.18	0.0008524	keep
Mar 2020	6483.74	8.777053	13.91	0.12	4859.0	0.65	4261.9	3.18	0.0008725	keep
Apr 2020	8773.11	9.079447	15.96	0.15	5222.0	0.05	4779.8	3.18	0.0262619	keep
May 2020	9688.32	9.178676	15.14	0.08	5535.0	0.05	16232.9	3.18	0.0000022	keep
Jun 2020	9188.06	9.125660	15.78	0.07	5863.0	0.08	16563.5	3.18	0.0035458	keep
Jul 2020	11118.92	9.316403	16.85	0.11	6320.0	0.09	16763.8	3.18	0.0004197	keep
Aug 2020	11657.00	9.363662	17.56	0.09	6692.0	0.10	16896.7	3.18	0.0004485	keep
Sep 2020	10764.28	9.283988	19.31	0.08	7396.0	0.09	17161.5	3.18	0.0192599	keep
Oct 2020	13573.71	9.515890	20.00	0.13	8323.0	0.09	17365.3	3.18	0.0049156	keep
Nov 2020	18114.41	9.804463	19.16	0.12	9096.0	0.09	17626.7	3.18	0.0159129	keep
Dec 2020	28768.84	10.267048	18.60	0.22	26722.7	0.09	17834.5	3.18	0.1374381	delete
Jan 2021	34622.37	10.452255	20.82	0.24	44349.4	0.09	18107.1	4.35	0.0541379	keep
Feb 2021	46642.61	10.750270	21.72	0.27	61976.1	0.08	18367.8	4.35	0.0018537	keep
Mar 2021	58734.48	10.980782	21.87	0.39	79602.8	0.07	18641.4	4.35	0.0024732	keep
Apr 2021	53260.30	10.882946	23.58	0.36	97229.5	0.07	18927.7	4.35	0.0083401	keep
May 2021	35749.66	10.484296	21.05	0.23	114856.2	0.06	19259.0	4.35	0.0125123	keep
Jun 2021	35945.79	10.489767	19.93	0.24	132482.9	0.08	19318.8	4.35	0.0045121	keep
Jul 2021	41157.15	10.625153	14.49	0.36	150109.6	0.10	19497.1	4.35	0.0412099	keep
Aug 2021	47663.02	10.771911	17.62	0.35	167736.3	0.09	19746.0	4.35	0.0134071	keep
Sep 2021	41412.61	10.631341	18.99	0.34	185363.0	0.08	19898.4	4.35	0.0001254	keep
Oct 2021	61374.28	11.024746	21.66	0.36	202989.7	0.08	20061.8	4.35	0.0081369	keep
Nov 2021	57834.36	10.965338	22.34	0.30	220616.4	0.08	20273.9	4.35	0.0025818	keep
Dec 2021	47128.47	10.760633	24.27	0.29	238243.1	0.08	20423.6	4.35	0.0856546	delete

We find three identified outliers in the added outlier column.

Next, we will delete the outliers from the dataset.

#redundant check of outlier count
outliers <- df_BTC_cooksd == 'delete'
sum(outliers)

## [1] 3

#remove the three outliers from the dataframe
BTC2 <- subset(BTC, outlier != "delete")

#verify that there are 52 observations remaining
nrow(BTC2)

## [1] 52

#Delete extraneous columns that resulted from the Cook's Distance formula
BTC3 <- BTC2[c(-10,-11)]

This is our complete dataset in all it’s glory.

BTC3 %>% 
  kbl() %>% 
  kable_styling()

Month-Year	BTC.Price	BTC.Price.Log	BTC.Mining.Difficulty	BTC.Mining.Profitability	BTC.Atm.Count.US	Fed.Fund.Rate	M1.Money.Supply	Global.Inflation.Rate
Jul 2017	2718.62	7.907880	0.86	0.88	826.0	1.15	3548.6	3.23
Aug 2017	4676.14	8.450228	0.89	1.62	853.0	1.16	3587.8	3.23
Sep 2017	4207.83	8.344702	1.10	0.84	908.0	1.15	3571.2	3.23
Oct 2017	6130.56	8.721041	1.45	1.11	948.0	1.15	3606.3	3.23
Nov 2017	10542.66	9.263185	1.35	2.08	1052.0	1.16	3631.0	3.23
Dec 2017	13062.15	9.477474	1.87	2.28	1125.0	1.30	3610.6	3.23
Jan 2018	9914.47	9.201751	2.60	0.91	1216.0	1.41	3656.2	3.59
Feb 2018	10903.02	9.296795	3.01	0.85	1319.0	1.42	3619.0	3.59
Mar 2018	6970.36	8.849422	3.46	0.46	1475.0	1.51	3663.4	3.59
Apr 2018	9268.72	9.134401	4.02	0.62	1624.0	1.69	3655.2	3.59
May 2018	7363.28	8.904261	4.31	0.44	1752.0	1.70	3650.8	3.59
Jun 2018	6189.31	8.730579	5.08	0.31	1880.0	1.82	3652.2	3.59
Jul 2018	8166.39	9.007782	5.95	0.34	1982.0	1.91	3677.5	3.59
Aug 2018	6938.91	8.844900	6.73	0.28	1988.0	1.91	3693.2	3.59
Sep 2018	6576.61	8.791275	7.15	0.25	2066.0	1.95	3701.8	3.59
Oct 2018	6276.46	8.744561	7.18	0.23	2119.0	2.19	3728.2	3.59
Nov 2018	3952.45	8.282091	6.65	0.15	2183.0	2.20	3705.6	3.59
Dec 2018	3689.56	8.213263	5.62	0.17	2248.0	2.27	3763.4	3.59
Jan 2019	3441.03	8.143526	5.81	0.17	2301.0	2.40	3754.8	3.47
Feb 2019	3772.94	8.235610	6.07	0.16	2428.0	2.40	3762.9	3.47
Mar 2019	4081.22	8.314151	6.38	0.17	2487.0	2.41	3733.6	3.47
Apr 2019	5151.43	8.547030	6.35	0.20	2602.0	2.42	3775.0	3.47
May 2019	8287.04	9.022448	7.46	0.27	2760.0	2.39	3783.2	3.47
Jun 2019	12024.08	9.394667	7.93	0.37	2915.0	2.38	3824.4	3.47
Jul 2019	9572.74	9.166675	9.01	0.32	3061.0	2.40	3856.8	3.47
Aug 2019	9590.74	9.168553	10.18	0.25	3271.0	2.13	3858.9	3.47
Sep 2019	8085.71	8.997854	12.76	0.16	3332.0	2.04	3900.0	3.47
Oct 2019	9226.00	9.129781	13.69	0.16	3508.0	1.83	3935.3	3.47
Nov 2019	7729.27	8.952770	12.97	0.14	3752.0	1.55	3955.3	3.47
Dec 2019	7251.28	8.888933	12.95	0.14	3943.0	1.55	4007.1	3.47
Jan 2020	9545.08	9.163781	15.47	0.15	4199.0	1.55	3993.6	3.18
Feb 2020	8778.47	9.080057	15.49	0.13	4548.0	1.58	4002.9	3.18
Mar 2020	6483.74	8.777053	13.91	0.12	4859.0	0.65	4261.9	3.18
Apr 2020	8773.11	9.079447	15.96	0.15	5222.0	0.05	4779.8	3.18
May 2020	9688.32	9.178676	15.14	0.08	5535.0	0.05	16232.9	3.18
Jun 2020	9188.06	9.125660	15.78	0.07	5863.0	0.08	16563.5	3.18
Jul 2020	11118.92	9.316403	16.85	0.11	6320.0	0.09	16763.8	3.18
Aug 2020	11657.00	9.363662	17.56	0.09	6692.0	0.10	16896.7	3.18
Sep 2020	10764.28	9.283988	19.31	0.08	7396.0	0.09	17161.5	3.18
Oct 2020	13573.71	9.515890	20.00	0.13	8323.0	0.09	17365.3	3.18
Nov 2020	18114.41	9.804463	19.16	0.12	9096.0	0.09	17626.7	3.18
Jan 2021	34622.37	10.452255	20.82	0.24	44349.4	0.09	18107.1	4.35
Feb 2021	46642.61	10.750270	21.72	0.27	61976.1	0.08	18367.8	4.35
Mar 2021	58734.48	10.980782	21.87	0.39	79602.8	0.07	18641.4	4.35
Apr 2021	53260.30	10.882946	23.58	0.36	97229.5	0.07	18927.7	4.35
May 2021	35749.66	10.484296	21.05	0.23	114856.2	0.06	19259.0	4.35
Jun 2021	35945.79	10.489767	19.93	0.24	132482.9	0.08	19318.8	4.35
Jul 2021	41157.15	10.625153	14.49	0.36	150109.6	0.10	19497.1	4.35
Aug 2021	47663.02	10.771911	17.62	0.35	167736.3	0.09	19746.0	4.35
Sep 2021	41412.61	10.631341	18.99	0.34	185363.0	0.08	19898.4	4.35
Oct 2021	61374.28	11.024746	21.66	0.36	202989.7	0.08	20061.8	4.35
Nov 2021	57834.36	10.965338	22.34	0.30	220616.4	0.08	20273.9	4.35

We are now ready to to make our Training and Test sets.

Section 2: Training and Test Sets

Creating training and holdout sets is mandatory to ensure the model is fits the data well.
We use the training set to develop the model, then we will apply the model to the test set in order to determine the model’s efficacy.¹⁰

Create the training and test sets.

#make the procedure reproducible
set.seed(1)

#use 70% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(BTC3), replace=TRUE, prob=c(0.7,0.3))
train  <- BTC3[sample, ]
test   <- BTC3[!sample, ]

We will return to the test dataset after we build our model with the training data first.

We confirm that the spread is 70% allocated to the training set and 30% allocated to the test set.

nrow(train)

## [1] 35

nrow(test)

## [1] 17

#Training set
35/52

## [1] 0.6730769

#Test Set
17/52

## [1] 0.3269231

Close enough. Now, let’s test for the data’s integrity.

Section 3: Multicollinearity & Variance Inflation Factor

Multicollinearity is the phenomena where multiple independent variables are correlated, or influence each other.¹¹ This is a problem because predictor variables should have no correlation or as little correlation as possible. We will test for this.

#subset the data only for the `independent variables`
train2 <- train[c(-1,-2, -3)]
par(mfrow=c(1,1))

#create a numeric vector
datamatrix <- cor(train2)

#make the correlation matrix
corrplot(datamatrix, method ='number')

On a cursory level, BTC.Mining.Difficulty, M1.Money.Supply, BTC.Atm.Count.US and Global.Inflation.Rate seem to have moderate to high correlation between other variables. Should we delete any of these predictor variables at this point?

To determine which variables to scrap, if needed, we employ Variance Inflation Factor (VIF) for augmented analysis.¹²

#be sure to revert to the original training set to reincorporate the independent variable into the model 
VIF<-vif(lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + BTC.Atm.Count.US 
            + Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, data = train))
VIF %>% 
  kbl() %>% 
  kable_styling()

	x
BTC.Mining.Difficulty	5.160656
BTC.Mining.Profitability	2.311268
BTC.Atm.Count.US	4.805880
Fed.Fund.Rate	4.974582
M1.Money.Supply	8.509970
Global.Inflation.Rate	3.940872

The VIF should be less than 10. A VIF greater than 10 indicates severe multicollinearity. Since all variables are under 10 (for some barely), we will retain all variables at this moment.¹³

On to the good stuff, let’s build the linear regression model!

Section 4: Modeling

Calculate the linear regression model.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + BTC.Atm.Count.US 
          + Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, data = train)
summary(model)

## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     BTC.Atm.Count.US + Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54677 -0.14387  0.03012  0.13608  0.42740 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.714e+00  6.619e-01   7.122 9.49e-08 ***
## BTC.Mining.Difficulty    7.520e-02  1.358e-02   5.538 6.38e-06 ***
## BTC.Mining.Profitability 5.905e-01  1.512e-01   3.906  0.00054 ***
## BTC.Atm.Count.US         6.591e-07  1.634e-06   0.403  0.68974    
## Fed.Fund.Rate            1.090e-01  1.052e-01   1.037  0.30864    
## M1.Money.Supply          1.757e-05  1.795e-05   0.979  0.33619    
## Global.Inflation.Rate    8.961e-01  1.993e-01   4.496  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2432 on 28 degrees of freedom
## Multiple R-squared:  0.9256, Adjusted R-squared:  0.9096 
## F-statistic: 58.02 on 6 and 28 DF,  p-value: 1.667e-14

Since the BTC.Atm.Count.US is the greatest non-significant variable (with a p-value > 0.05), we will delete this variable.

With the deleted predictor variable, we run the process again.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + Fed.Fund.Rate 
          + M1.Money.Supply + Global.Inflation.Rate, data = train)
summary(model)

## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     Fed.Fund.Rate + M1.Money.Supply + Global.Inflation.Rate, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5430 -0.1556  0.0394  0.1362  0.4288 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.531e+00  4.745e-01   9.549 1.86e-10 ***
## BTC.Mining.Difficulty    7.566e-02  1.333e-02   5.674 3.91e-06 ***
## BTC.Mining.Profitability 5.987e-01  1.476e-01   4.056 0.000344 ***
## Fed.Fund.Rate            1.057e-01  1.033e-01   1.023 0.314770    
## M1.Money.Supply          1.905e-05  1.732e-05   1.100 0.280297    
## Global.Inflation.Rate    9.478e-01  1.504e-01   6.303 6.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2396 on 29 degrees of freedom
## Multiple R-squared:  0.9251, Adjusted R-squared:  0.9122 
## F-statistic: 71.66 on 5 and 29 DF,  p-value: 2.007e-15

Since the Federal.Fund.Rate is the next largest non-significant variable (with a p-value > 0.05), we will delete this variable.

With the deleted predictor variable, we run the process again.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + M1.Money.Supply 
          + Global.Inflation.Rate, data = train)
summary(model)

## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     M1.Money.Supply + Global.Inflation.Rate, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51257 -0.13521  0.02273  0.14103  0.49358 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.598e+00  4.703e-01   9.777 7.71e-11 ***
## BTC.Mining.Difficulty    7.148e-02  1.270e-02   5.627 3.97e-06 ***
## BTC.Mining.Profitability 5.288e-01  1.310e-01   4.038 0.000344 ***
## M1.Money.Supply          7.292e-06  1.296e-05   0.563 0.577871    
## Global.Inflation.Rate    1.012e+00  1.367e-01   7.406 2.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2398 on 30 degrees of freedom
## Multiple R-squared:  0.9224, Adjusted R-squared:  0.9121 
## F-statistic: 89.18 on 4 and 30 DF,  p-value: 3.29e-16

Since the M1.Money.Supply is the next least significant variable (with a p-value > 0.05), we will delete this variable.

With the deleted predictor variable, we run the process again.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability 
          + Global.Inflation.Rate, data = train)
summary(model)

## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability + 
##     Global.Inflation.Rate, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4913 -0.1487  0.0283  0.1471  0.4853 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.443317   0.376956  11.787 5.50e-13 ***
## BTC.Mining.Difficulty    0.076886   0.008222   9.351 1.55e-10 ***
## BTC.Mining.Profitability 0.557522   0.119266   4.675 5.45e-05 ***
## Global.Inflation.Rate    1.051885   0.115775   9.086 3.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2372 on 31 degrees of freedom
## Multiple R-squared:  0.9216, Adjusted R-squared:  0.914 
## F-statistic: 121.5 on 3 and 31 DF,  p-value: < 2.2e-16

Executive Decision: I am not convinced that the Global.Inflation.Rate variable is a good metric in this particular instance. The reason being: I was not able to capture a month-to-month assessment in keeping with the data; instead, I scraped a blanket average for each year, given the resources at hand.¹⁴

Global Inflation Rate may fluctuate month-to-month. While it may have an attractive p-value for this model, the variable is not trustworthy due to my lack of capturing better granualarity, inconsistent with the other remaining predictor variables.

Thus, due to how I perceive the Global Inflation Rate interacts with the data, I am going to delete this variable, leaving only BTC.Mining.Difficulty and BTC.Mining.Profitability as the star predictor variables.

model<-lm(BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, data = train)
summary(model)

## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62588 -0.42858  0.09613  0.35924  0.96722 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.69717    0.22158  34.738  < 2e-16 ***
## BTC.Mining.Difficulty     0.11588    0.01321   8.771 5.06e-10 ***
## BTC.Mining.Profitability  0.78244    0.21977   3.560  0.00118 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4468 on 32 degrees of freedom
## Multiple R-squared:  0.7129, Adjusted R-squared:  0.6949 
## F-statistic: 39.72 on 2 and 32 DF,  p-value: 2.136e-09

print(model)

## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, 
##     data = train)
## 
## Coefficients:
##              (Intercept)     BTC.Mining.Difficulty  BTC.Mining.Profitability  
##                   7.6972                    0.1159                    0.7824

Thus, the significant independent variables that we will retain are: BTC.Mining.Difficulty and BTC.Mining.Profitability.

${R^2}$ is 0.7129, which means that the model created explains 71.29% of the variability in the given test data set.

Our model follows the form $Y_{0}$ = $\beta_{0}$ + $\beta_{1}$$X_{i}$ + $\beta_{2}$$X_{i}$ + $\varepsilon_{i}$

Which transforms into:

BTC.Price.Log = 7.69 + 0.12$X_{BTC.Mining.Difficulty}$ + 0.78$X_{BTC.Mining.Profitability}$

7.69 is the equation constant intercept.

A 0.12 increase in BTC.Price.Log is associated with a BTC.Mining.Difficulty increase by a factor of 1.

A 0.78 increase in BTC.Price.Log is associated with a BTC.Mining.Profitability increase by a factor of 1.

Section 5: Testing Assumptions

Check for homoscedasticity

#visual look
par(mfrow=c(1,1))
plot(model)

At a cursory glance, although there is not very much data provided in the training set (35 observations), the data seems to meet the assumption, with a seemingly constant scatter through the residual output.

However, just to be sure, verify with the Breusch-Pagan test

lmtest::bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.4359, df = 2, p-value = 0.4878

Since the null hypothesis of the Breusch-Pagan test is that heteroscedastiy is present, and the p value > 0.05 (not significant), we reject the null hypothesis and determine that the model’s residuals are homoscedastic.

Check for normal distribution

ols_plot_resid_hist(model)

Most of the residuals fall around zero, and the number of observations in the tails of the histogram is low, therefore the regression model is fairly normally distributed. For OCD types wanting a picture perfect bell-shaped curve spread, calm down, looks can be deceiving.

Let’s implement a Moneyball approach with some hard math.

ols_test_normality(model)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9379         0.0482 
## Kolmogorov-Smirnov         0.12          0.6507 
## Cramer-von Mises          4.0129         0.0000 
## Anderson-Darling           0.77          0.0411 
## -----------------------------------------------

Since the number of observations in our test dataset are less than 50, we use Shapiro-Wilk test. Shapiro-Wilk shows that the p-value is less than 0.05, meaning we reject the null hypothesis that the residuals are not normally distributed, and rather that the residuals are indeed normally distributed. This model is normally distributed (by a hair), but that’s good for our purposes. It gets on base.

Assess multicollinearity again.

remaining_variables <- train[c(4,5)]
par(mfrow=c(1,1))
datamatrix <- cor(remaining_variables)
corrplot(datamatrix, method ='number')

We confirm no outstanding multicollinearity with the remaining dependent variables in the model.

Additionally, they are moderately negatively correlated. A decrease in mining difficulty indicates an increase in mining profitability, and an increase in mining difficulty indicates a decrease in mining profitability.

Independence of Errors: the residuals indicate that the noise terms are independent (possessing no autocorrelation, and confirmed with the multicollinearity test). Furthermore, the noise terms are independent of the explanatory variables and there is no obvious string of positive to negative residuals, or any other definable pattern.

However, we are not finished yet.

“I gotta lotta mo!”
— Mr. T, Rocky III

Section 6: Testing Root Means Squared Predicted Error (RMSPE)

RMSPE determines how well a model fits the hold-out data. This is where we will apply the training set model on the test dataset, to validate where or not it fits the data.¹⁵

Use thepredict function with the training set-developed model.

p <- predict(model, newdata = test)
error <- test$BTC.Price.Log - p
RMSPE <- sqrt(mean(error^2))
RMSPE

## [1] 0.3822073

Recall that the standard error of the model is ~0.44 (reference Residual standard error in output).

summary(model)

## 
## Call:
## lm(formula = BTC.Price.Log ~ BTC.Mining.Difficulty + BTC.Mining.Profitability, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62588 -0.42858  0.09613  0.35924  0.96722 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.69717    0.22158  34.738  < 2e-16 ***
## BTC.Mining.Difficulty     0.11588    0.01321   8.771 5.06e-10 ***
## BTC.Mining.Profitability  0.78244    0.21977   3.560  0.00118 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4468 on 32 degrees of freedom
## Multiple R-squared:  0.7129, Adjusted R-squared:  0.6949 
## F-statistic: 39.72 on 2 and 32 DF,  p-value: 2.136e-09

Since RMSPE of 0.38 is smaller than the Residual Standard error of 0.44, we determine that the training set model predicts the holdout set better than the training (validation) set, which is great news.

We’re good to hook and make a prediction on January 2022's Bitcoin Price.

Section 7: Predicting Bitcoin Price in January 2022

From Section 4: Modeling, our stellar model equation is:

BTC.Price.Log = 7.69 + 0.12$X_{BTC.Mining.Difficulty}$ + 0.78$X_{BTC.Mining.Profitability}$

In January 2022, the BTC.Mining.Difficulty was ~25.17 and the BTC.Mining.Profitability was ~0.19.

Calculate the price prediction, remembering to take the inverse log function of the result to convert it to Bitcoin in USD, as we transformed the original data logarithmically in Section 1: Data Cleaning.

c <- 7.69 + (.12*25.17)+(0.78*0.19)
exp(c) #take inverse log function, since we transformed the original data logarithmically during the data cleaning phase

## [1] 51979.26

Calculate the 95% confidence interval.¹⁶

data(train)
sample.mean <- mean(train$BTC.Price.Log)
sample.n <- length(train$BTC.Price.Log)
sample.sd <- sd(train$BTC.Price.Log)
sample.se <- sample.sd/sqrt(sample.n)
alpha = 0.05
degrees.freedom = sample.n - 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
margin.error <- t.score * sample.se
lower.bound <- sample.mean - margin.error
upper.bound <- sample.mean + margin.error
print(c(lower.bound,upper.bound))

## [1] 9.011596 9.567301

Take the reverse log to re-transform the interval into USD.

exp(9.011596)

## [1] 8197.594

exp(9.567301)

## [1] 14289.8

The model predicts Bitcoin price at $51,979.26 in January 2022, with a 95% confidence interval of [$8,197.59, $14,289.80]. Again, ${R^2}$ is 0.7129 and Standard Error is 0.4468.

Conclusion

The actual price of Bitcoin in January 2022 was between $37,188.11 to $38,483.13.¹⁷ Intuitively, given the volatility of Bitcoin in the last year, a difference between the model’s prediction and the actual price, of around $13,496.13 to $14,791.15. Additionally, the 95% confidence interval complements this range. The model seems to assume a sensible range for extrapolating future near-term Bitcoin price.

However, as the model is based on 35 observations, it is ill-equipped for any serious forecasting. See the Bitcoin Price Forecasting (Time-Series) model in the footnotes.¹⁸ Given the data, the analysis backing this model is stronger at identifying which potential explanatory variables are statistically significant, along with those that are frivolous. Yet, for its purpose, this model has offered some initiial insight on the actual hard factors that correlate to its price.

Deleted variables

These variables were statistically insignificant in correlating to Bitcoin’s price: BTC.ATM.Count.US, M1 Money Supply, and Federal Funds Rate. They all seemed to be good economic barometers, or similar, that might influence Bitcoin value. However, in context, Bitcoin operates in its own cryptographic ecosystem, as clearly shown by the Mining Difficulty and Mining Profitability variables making the final cut.

BTC ATMs in the US
The increase of BTC ATM supply nationwide does not necessarily assume that retail consumers or any other parties are using them to buy more Bitcoin, or influence its valuation in any way. It might be interesting to conduct a sample survey in most metropolitan areas and suburbs as to how often BTC ATMs (or BTMs) are actually used.

M1 Money Supply

One plausible explanation for this variable’s insignificance is due to the fact that fiat is not the only method for procuring Bitcoin. Bitcoin can be mined and swapped back and forth from other currencies (albeit through expensive marketplaces). Additionally, the size of M1 does not imply at any time that a significant portion of the supply is used to purchase Bitcoin.

Federal Funds Rate

The Federal Funds Rate is often hyped by media as a potential predictor on Bitcoin bear and bull markets, as of late. Yet, it does not appear to be in any way significant for this particular model. Future case studies involving a more granular capture of the Federal Funds Rate and its impact on Bitcoin might offer more insight. However, as the Federal Funds Rate to Bitcoin Price relation has not been clearly identified (statistically) today, one may assume that the Federal Funds Rate has little-to-no impact on Bitcoin’s price fluctuation in this model.

Global Inflation Rate

It would be prudent to revisit this variable. The Global Inflation Rate appeared to have significance in correlation to Bitcoin’s price; however, I did not trust the source data as it did not align with the other independent variables, and for that reason removed it preemptively.

A look at the remaining predictor variables

scatterplot3d(BTC3$BTC.Mining.Profitability, 
              BTC3$BTC.Mining.Difficulty, 
              BTC3$BTC.Price, 
              pch = 19, color = "black",
              xlab="Mining Profitability", ylab="Mining Difficulty", 
              zlab="Bitcoin Price",
              highlight.3d = TRUE, angle=45,
              trees, type="h",
              main="BTC Price against Mining Profitabilty and Mining Difficulty", 
              scale.y=0.7)

This figure isolates all 52 observation variables. The dataset clearly demonstrates the effect that Bitcoin Mining Difficulty has on Bitcoin Price. A higher Bitcoin Difficulty (difficulty in mining a Bitcoin block, or for dedicated hardware to solve a mathematical or algorithmic hash below a given target level, considering the amount of other miners attempting the same task), correlates with a higher Bitcoin Price. There may be a level of cognitive dissonance associated with recognizing that Mining Profitability appears to take a secondary placement (visually) in determining Bitcoin Price. Within the confines of this analysis, remember that the final multicollinearity shows Mining Profitability and Mining Difficulty being inversely correlated. Contextually, the more miners employed in mining a Bitcoin block, the less profitable it is for one single miner (or cluster of miners) to earn sizable revenue (hence, the popularity of mining pools).

Adjusting for Seasonality

Keep in mind that this analysis started with 55 observations and the model was based on 35, a very small amount of data. The adage of most data scientists and analysts is to get more (relevant) data. A greater quantity of observations, contingent on the accommodating date ranges of the other explanatory variables, would allow room for lagging the economic variables to adjust for inertia and seasonality. This would reinforce the model, and perhaps expose different patterns and behavior, in future iterations.

Ultimately, while domestic and global external macroeconomic factors influencing retail and institutional investor habits may affect Bitcoin price secondarily, the intrinsic factors identified only seem to have the most statistical signifiance, in determining Bitcoin valuation.

For any questions or comments on this analysis please contact kris.kilgroe@sigmacode.io.

Session Info

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 20.3
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4     lmtest_0.9-40        zoo_1.8-10          
##  [4] olsrr_0.5.3          scatterplot3d_0.3-41 car_3.1-0           
##  [7] carData_3.0-5        corrplot_0.92        broom_0.8.0         
## [10] caret_6.0-92         lattice_0.20-45      forcats_0.5.1       
## [13] stringr_1.4.0        dplyr_1.0.9          purrr_0.3.4         
## [16] readr_2.1.2          tidyr_1.2.0          tibble_3.1.7        
## [19] ggplot2_3.3.6        tidyverse_1.3.1      readxl_1.4.0        
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-3     ellipsis_0.3.2       class_7.3-20        
##  [4] fs_1.5.2             rstudioapi_0.13      farver_2.1.0        
##  [7] listenv_0.8.0        prodlim_2019.11.13   fansi_1.0.3         
## [10] lubridate_1.8.0      xml2_1.3.3           codetools_0.2-18    
## [13] splines_4.2.1        knitr_1.39           jsonlite_1.8.0      
## [16] pROC_1.18.0          dbplyr_2.2.0         compiler_4.2.1      
## [19] httr_1.4.3           backports_1.4.1      assertthat_0.2.1    
## [22] Matrix_1.4-1         fastmap_1.1.0        tufte_0.12          
## [25] cli_3.3.0            htmltools_0.5.2      tools_4.2.1         
## [28] gtable_0.3.0         glue_1.6.2           reshape2_1.4.4      
## [31] Rcpp_1.0.8.3         cellranger_1.1.0     jquerylib_0.1.4     
## [34] vctrs_0.4.1          svglite_2.1.0        nlme_3.1-157        
## [37] iterators_1.0.14     timeDate_3043.102    gower_1.0.0         
## [40] xfun_0.31            globals_0.15.0       rvest_1.0.2         
## [43] lifecycle_1.0.1      goftest_1.2-3        future_1.26.1       
## [46] MASS_7.3-57          scales_1.2.0         ipred_0.9-13        
## [49] hms_1.1.1            parallel_4.2.1       yaml_2.3.5          
## [52] gridExtra_2.3        sass_0.4.1           rpart_4.1.16        
## [55] stringi_1.7.6        highr_0.9            foreach_1.5.2       
## [58] nortest_1.0-4        hardhat_1.1.0        lava_1.6.10         
## [61] rlang_1.0.2          pkgconfig_2.0.3      systemfonts_1.0.4   
## [64] evaluate_0.15        labeling_0.4.2       recipes_0.2.0       
## [67] tidyselect_1.1.2     parallelly_1.32.0    plyr_1.8.7          
## [70] magrittr_2.0.3       bookdown_0.27        R6_2.5.1            
## [73] generics_0.1.2       DBI_1.1.3            pillar_1.7.0        
## [76] haven_2.5.0          withr_2.5.0          survival_3.2-13     
## [79] abind_1.4-5          nnet_7.3-17          future.apply_1.9.0  
## [82] modelr_0.1.8         crayon_1.5.1         utf8_1.2.2          
## [85] tzdb_0.3.0           rmarkdown_2.14       grid_4.2.1          
## [88] data.table_1.14.2    rmdformats_1.0.4     ModelMetrics_1.2.2.2
## [91] webshot_0.5.3        reprex_2.0.1         digest_0.6.29       
## [94] stats4_4.2.1         munsell_0.5.0        viridisLite_0.4.0   
## [97] bslib_0.3.1

Footnotes

Link to BTC Price dataset: https://docs.google.com/spreadsheets/d/10GtR7HMJutJqx54iJ4SMSoq1cPnl8pTv/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Link to BTC Mining Difficulty dataset: https://docs.google.com/spreadsheets/d/1yuuzSyCeDlxePXh10_fVgtpxVC_6sChe/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Link to BTC.Mining.Profitability dataset: https://docs.google.com/spreadsheets/d/1idsNZTugrcouyY-HGgITjXmjgIuOt26O/edit?usp=sharing&rtpof=true&sd=true ↩︎
Link to BTC ATM Count US dataset: https://docs.google.com/spreadsheets/d/1XEdA-5PBidwvEAQbSvuCm9-v47QHfU-U/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Link to Federal Funds Rate dataset: https://docs.google.com/spreadsheets/d/13rb6HKnxUBTr1xInNGG-dKbTgMSaWDvL/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Link to M1 Money supply dataset: https://docs.google.com/spreadsheets/d/1ofoa4V2hxwjJLhcRyR8J_mnfAcB1DIZ8/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Link to Global Inflation Rate: https://docs.google.com/spreadsheets/d/1nMocnrsLW6IffI528BJYKy11mjfXKpgm/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Link to custom dataset: https://docs.google.com/spreadsheets/d/1iIK7zBqW9gqbwSmxVG5J9P1kIiw3vam-/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Link to Cook's Distance explanation: https://towardsdatascience.com/identifying-outliers-in-linear-regression-cooks-distance-9e212e9136a ↩︎
Link to Test and Validation Datasets: https://machinelearningmastery.com/difference-test-validation-datasets/↩︎
Link to Multicollinearity: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/multicollinearity.html ↩︎
VIF defined here: https://online.stat.psu.edu/stat462/node/180/↩︎
For an argument on what determined a “good” VIF level: https://www.researchgate.net/post/Multicollinearity_issues_is_a_value_less_than_10_acceptable_for_VIF ↩︎
Link to Global Inflation Rate: https://docs.google.com/spreadsheets/d/1nMocnrsLW6IffI528BJYKy11mjfXKpgm/edit?usp=sharing&ouid=116278907272817573735&rtpof=true&sd=true ↩︎
Explanation on RMSPE at: https://www.sciencedirect.com/topics/engineering/root-mean-squared-error ↩︎
Calculating Confidence Intervals in R: https://bookdown.org/logan_kelly/r_practice/p09.html ↩︎
Bitcoin Price for January 2022: https://infomediang.com/`Bitcoin`-price-for-january-2022/↩︎
See Bitcoin Price Forecasting (Time-Series) using the Prophet package at: https://rpubs.com/KrisKilgroe/BTC_price_forecasting ↩︎