Bitcoin Price Model Analysis - Final - Group 2

Contributor - Ms. Farlin, Mr. Zinnun, Mr. Arafat, Mr. Irfan, Mr Towhid

1 Objective of the Project

The main purpose of this project is to implement the acquired knowledge in a real-life situation using Regression & Time Series analysis in detail. In this project, we will perform Regression & Time Series analysis and forecasting techniques on Bitcoin historical price.

2 Dataset

We are given a dataset of Bitcoin (BTC) historical price against USD. This is a monthly average price dataset having prices from 01 January 2015 to 30 November 2023.

3 Applied Models

Below models are applied in this project:

Linear regression model
Quadratic regression model
ARIMA model

4 Task to Perform

4.1 Loading the Dataset & neccessary libraries:

The dataset is loaded and checked for data types and missing values. The dataset was found to have no missing values.

# Load necessary libraries
library(tidyverse)
library(lubridate)
library(forecast)
library(tseries)
library(kableExtra)
library(dplyr)

## Load the dataset
BitCoin <- read.csv("E:/BTC_Monthly_grp2.csv")
#BitCoin <- read.csv("C:\\Users\\hp\\Downloads\\BTC-Monthly.csv")



# Check the data types of the features
str(BitCoin)
'data.frame':   107 obs. of  2 variables:
 $ Date : chr  "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" ...
 $ Price: num  217 254 244 236 230 ...


# Assign appropriate data type to features
BitCoin$Date <- as.Date(BitCoin$Date, format = "%Y-%m-%d")
names(BitCoin)[names(BitCoin) == "Close"] <- "Price"
# Check the structure of the data frame
str(BitCoin)
'data.frame':   107 obs. of  2 variables:
 $ Date : Date, format: "2015-01-01" "2015-02-01" ...
 $ Price: num  217 254 244 236 230 ...


# Check if there’s any missing value
sum(is.na(BitCoin))
[1] 0

4.2 Descriptive Analytics:

# Copy the BitCoin data frame to a new data frame named BitCoin_df
BitCoin_df <- BitCoin

# Create two more columns 'month' & 'year' by populating with the months & years values from the 'Date' column
BitCoin_df$month <- format(BitCoin_df$Date, "%m")
BitCoin_df$year <- format(BitCoin_df$Date, "%Y")

# Create a monthly boxplot of prices
library(ggplot2)
ggplot(BitCoin_df, aes(x = month, y = Price, fill = month)) + 
  geom_boxplot() + 
  theme_minimal() + 
  ggtitle("Monthly Boxplot of Bitcoin Prices")


# Create a yearly boxplot of prices
ggplot(BitCoin_df, aes(x = year, y = Price, fill = year)) + 
  geom_boxplot() + 
  theme_minimal() + 
  ggtitle("Yearly Boxplot of Bitcoin Prices")


# Create year wise trend lines of prices
ggplot(BitCoin_df, aes(x = Date, y = Price, color = year)) + 
  geom_line() + 
  theme_minimal() + 
  ggtitle("Year-wise Trend Lines of Bitcoin Prices")


# Convert the BitCoin data frame to a time series object with frequency 1
library(zoo)
btc_ts <- zoo(BitCoin$Price, order.by = BitCoin$Date)

# Plot the time series of monthly prices on years
plot(btc_ts, ylab = "Monthly BTC Price", xlab = "Time", type = "o", col = "blue", main = "Time Series of Monthly Bitcoin Prices")


# Find the relationship between consecutive months. Show the correlation through a scatter plot
cor(BitCoin_df$Price[-1], BitCoin_df$Price[-nrow(BitCoin_df)])
[1] 0.9617764

ggplot(BitCoin_df[-1,], xlab="Price of previous month", ylab="Monthly BTC price", aes(x = BitCoin_df$Price[-nrow(BitCoin_df)], y = Price)) + 
  geom_point() +
  labs(x = "Price of Previous Month", y = "Monthly BTC Price") +
  theme_minimal() + 
  ggtitle("Correlation between Consecutive Months")

#### Monthly Boxplot of Bitcoin Prices The monthly boxplot shows the distribution of Bitcoin prices across different months, highlighting any seasonality or monthly trends. There might be significant fluctuations in some months indicating high volatility.

Yearly Boxplot of Bitcoin Prices

The yearly boxplot illustrates how Bitcoin prices vary across different years. It helps in understanding the long-term trends and the extent of price variations over the years.

Year-wise Trend Lines of Bitcoin Prices

This plot provides a visual representation of the price trends over time. It helps in identifying overall growth patterns, significant peaks, and drops.

Correlation between Consecutive Months

A high correlation (0.9618) between consecutive months suggests that Bitcoin prices are highly dependent on the prices in the previous month, indicating a strong autocorrelation in the data.

4.3 Regression Analysis

4.3.1 Linear Regression

# Create a linear model of the time series dataset
linear_model <- lm(Price ~ Date, data = BitCoin_df)

# Show the summary of the model and explain the outcome
summary(linear_model)

Call:
lm(formula = Price ~ Date, data = BitCoin_df)

Residuals:
   Min     1Q Median     3Q    Max 
-15114  -7997  -2255   3065  35626 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.211e+05  1.939e+04  -11.40   <2e-16 ***
Date         1.308e+01  1.073e+00   12.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10430 on 105 degrees of freedom
Multiple R-squared:  0.586, Adjusted R-squared:  0.5821 
F-statistic: 148.6 on 1 and 105 DF,  p-value: < 2.2e-16

Regression Analysis

Linear Regression

Model Summary:
- R2 : 0.586.
- Adjusted R2 : 0.5821.
- The model explains approximately 58.6% of the variance in Bitcoin prices.While this is a substantial proportion, there is still a significant amount of variance that is not explained by the model.
- The p-value for the Date coefficient is < 2e-16, indicating a significant relationship.

# Create a plot of the linear model on top of the time series dataset line plot with scatter data points
ggplot(BitCoin_df, aes(x = Date, y = Price)) + 
  geom_point() + 
  geom_line() + 
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Time", y = "Monthly Average BTC Price") +
  theme_minimal() + 
  ggtitle("Linear Regression Model on Time Series Data")


# Perform residual analysis and create a line & scatter plot of the residuals. Explain the outcome
residuals <- resid(linear_model)
plot(BitCoin_df$Date, residuals,xlab="Time", ylab="Residuals", type = "o", col = "blue", main = "Residuals of Linear Model")

Residual Analysis:
- Pattern in Residuals: The residual plot showed patterns, suggesting non-linearity and autocorrelation. Ideally, residuals should be randomly scattered around zero without any discernible pattern.

# Create a histogram plot of the residuals. Explain the outcome
hist(residuals, xlab="Residuals", breaks = 30, col = "lightblue", main = "Histogram of Residuals")

Histogram of Residuals: The histogram indicated that the residuals are not normally distributed. Normality of residuals is an assumption of linear regression.

# Create ACF & PACF plots of residuals. Explain the outcome
acf(residuals)

pacf(residuals)

ACF and PACF of Residuals: The ACF and PACF plots showed significant autocorrelation in the residuals, indicating that the residuals are not independent.

# Create QQ plot of residuals. Explain the outcome
qqnorm(residuals)
qqline(residuals, col = "red")

QQ Plot: The QQ plot showed that the residuals deviate from the line, suggesting that they are not normally distributed.

# Perform Shapiro-Wilk test on residuals. Explain the outcome
shapiro.test(residuals)

    Shapiro-Wilk normality test

data:  residuals
W = 0.85983, p-value = 1.215e-08

Shapiro-Wilk Test: The Shapiro-Wilk test returned a p-value of 1.215e-08, confirming that the residuals are not normally distributed.
Autocorrelation: The high correlation between consecutive months (0.9618) suggests strong autocorrelation, which is not captured by the linear model.
Model Appropriateness:

While the linear regression model indicates a statistically significant relationship between Date and Bitcoin Price, several assumptions of linear regression are violated:

The residuals are not normally distributed.
There is significant autocorrelation in the residuals.
The residuals show patterns indicating non-linearity.

Given these violations, the linear regression model may not be the most appropriate for accurately modeling and forecasting Bitcoin prices.

4.3.2 Quadratic Regression

# Create a quadratic model of the time series dataset
BitCoin_df$Date <- as.numeric(BitCoin_df$Date)
quadratic_model <- lm(Price ~ poly(Date, 2), data = BitCoin_df)

# Show the summary of the model and explain the outcome
summary(quadratic_model)

Call:
lm(formula = Price ~ poly(Date, 2), data = BitCoin_df)

Residuals:
   Min     1Q Median     3Q    Max 
-15872  -7420  -1996   2666  36106 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       14944       1010  14.794   <2e-16 ***
poly(Date, 2)1   127161      10449  12.170   <2e-16 ***
poly(Date, 2)2     8246      10449   0.789    0.432    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10450 on 104 degrees of freedom
Multiple R-squared:  0.5885,    Adjusted R-squared:  0.5806 
F-statistic: 74.36 on 2 and 104 DF,  p-value: < 2.2e-16


# Plot the quadratic regression
ggplot(BitCoin_df, aes(x = Date, y = Price)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ poly(x, 2), col = "blue") +
  labs(title = "Quadratic Regression on Bitcoin Prices", x = "Date", y = "Close Price")

## Quadratic Regression

Model Summary:
- 𝑅2 : 0.5885.
- Adjusted R2 : 0.5806.
- Similar𝑅2 to the linear model but includes a non-significant quadratic term.
Model Appropriateness:
- The quadratic term is not significant (p-value = 0.432), suggesting that the quadratic model does not significantly improve the fit compared to the linear model.

*** Explain if quadratic model is appropriate or not:***

The quadratic term is not significant (p-value = 0.432), suggesting that the quadratic model does not significantly improve the fit compared to the linear model. Based on the model summary and the characteristics of Bitcoin price data,the non-significance of the quadratic term and the moderate R-squared value suggest that this model does not capture the complexity of the data adequately.

ARIMA Model Explanation:

Load Libraries: forecast, tseries, and lmtest libraries are loaded.
Convert to Time Series: The Price column is converted to a time series object btc_ts.
Handle Missing Values: Missing values are interpolated using na.approx.
ACF & PACF Plots: Plots for ACF and PACF with a maximum lag of 24.
ADF Test: Perform the Augmented Dickey-Fuller (ADF) test to check for stationarity.
QQ Plot & Shapiro-Wilk Test: QQ plot and Shapiro-Wilk test for normality.
Differencing: If necessary, the dataset is differenced to make it stationary.
Differenced ACF & PACF: ACF and PACF plots for the differenced series.
ARIMA Models: Fit three ARIMA models with different orders.
Coefficient Tests: Perform coefficient tests on the fitted models.
Model Evaluation: Evaluate models using AIC and BIC values.

4.4 ARIMA Model

Complete R Markdown Code for ARIMA Model Section as below:

# Load necessary libraries
library(lmtest)

# Convert the Bitcoin data frame to a time series object with frequency 12 (monthly data)
btc_ts <- ts(BitCoin$Price, start = c(2015, 1), frequency = 12)

# Check for and handle missing values
if (any(is.na(btc_ts))) {
  btc_ts <- na.approx(btc_ts)  # Linear interpolation to handle missing values
}
# ACF & PACF
par(mfrow=c(1,2))
acf(btc_ts, lag.max = 24)
pacf(btc_ts, lag.max = 24)

library(tseries)
adf.test(btc_ts)

    Augmented Dickey-Fuller Test

data:  btc_ts
Dickey-Fuller = -2.5743, Lag order = 4, p-value = 0.3385
alternative hypothesis: stationary

qqnorm(y=btc_ts,main = "QQ Plot.")
qqline(y=btc_ts,col=2,lwd=1,lty=2)


sw <-  shapiro.test(btc_ts)
sw

    Shapiro-Wilk normality test

data:  btc_ts
W = 0.83358, p-value = 1.258e-09

#Differencing
stationary <- diff(btc_ts)
par(mfrow=c(1,1))
plot(stationary,type='o',ylab="value series",main="Times series plot of the first difference of the Bitcoin price series")


adf.test(stationary)

    Augmented Dickey-Fuller Test

data:  stationary
Dickey-Fuller = -5.1599, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

#Model Selection
par(mfrow=c(1,2))
acf(stationary, main= "ACF plot of the first difference", lag.max = 30)
pacf(stationary, main= "PACF plot of the first difference", lag.max = 30)

#EACF test
library(TSA)
eacf(stationary, ar.max=3, ma.max=3)
AR/MA
  0 1 2 3
0 o o o o
1 x x o o
2 o x o o
3 x o x o

#ARIMA(0,1,1)
model_011= arima(stationary,order=c(0,1,1))
coeftest(model_011)

z test of coefficients:

     Estimate Std. Error z value  Pr(>|z|)    
ma1 -1.000000   0.028907 -34.594 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#ARIMA(0,1,2)
model_012= arima(stationary,order=c(0,1,2))
coeftest(model_012)

z test of coefficients:

     Estimate Std. Error z value  Pr(>|z|)    
ma1 -0.735344   0.099068 -7.4227 1.148e-13 ***
ma2 -0.264656   0.095704 -2.7654  0.005686 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#ARIMA(1,1,2)
model_112= arima(stationary,order=c(1,1,2))
coeftest(model_112)

z test of coefficients:

    Estimate Std. Error z value Pr(>|z|)   
ar1 -0.11182    0.23669 -0.4724 0.636606   
ma1 -0.64103    0.21104 -3.0374 0.002386 **
ma2 -0.35894    0.20954 -1.7130 0.086706 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#ARIMA(2,1,2)
model_212= arima(stationary,order=c(2,1,2))
coeftest(model_212)

z test of coefficients:

     Estimate Std. Error z value Pr(>|z|)   
ar1  0.292715   0.372707  0.7854 0.432234   
ar2 -0.207007   0.118281 -1.7501 0.080095 . 
ma1 -1.057770   0.378124 -2.7974 0.005151 **
ma2  0.057778   0.377031  0.1532 0.878206   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


# Extract AIC and BIC values
aic_values <- c(AIC(model_011), AIC(model_012), AIC(model_112), AIC(model_212))
bic_values <- c(BIC(model_011), BIC(model_012), BIC(model_112), BIC(model_212))

# Create a data frame
model_table <- data.frame(
  Model = c("model_011", "model_012", "model_112", "model_212"),
  AIC = aic_values,
  BIC = bic_values
)

# Print the table
print(model_table)
      Model      AIC      BIC
1 model_011 2070.910 2076.218
2 model_012 2066.781 2074.743
3 model_112 2068.555 2079.171
4 model_212 2068.457 2081.727

model_011 = Arima(stationary, order = c(0, 1, 1))
model_012 = Arima(stationary, order = c(0, 1, 2))

a011 <- accuracy(model_011)
a012 <- accuracy(model_012)



df_models <- data.frame(
rbind(a011, a012)
)
colnames(df_models) <- c("ME", "RMSE", "MAE", "MPE", "MAPE", "MASE", "ACF1")
rownames(df_models) <- c("ARIMA(0,1,1)", "ARIMA(1,0,1)")

kable(df_models, digits = 2, formats="html", row.names = TRUE) %>%
  kable_styling(full_width = F, font_size = 12, position = "center")

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1
ARIMA(0,1,1)	149.13	4431.72	2507.52	69.50	131.37	0.61	0.19
ARIMA(1,0,1)	119.83	4312.32	2490.54	-133.07	345.53	0.60	-0.03

res_011 <- rstandard(model_011)
res_012 <- rstandard(model_012)

#Model 011
par(mfrow = c(1, 1))

# Plot 1: Time series plot of standardized residuals
plot(res_011,
     xlab = "Index", ylab = "Residuals",
     main = "Time series plot of standardized residuals")
lines(res_011, col = "blue")
abline(h = 0.05, col = "red", lty = 2)

#Histogram
hist(res_011,breaks=20, col="skyblue",
     xlab="Residuals",ylab="Frequency",
     main="Histogram of Residuals")

# ACF & PACF of Residuals
acf_011 <- acf(res_011)

pacf_011 <- pacf(res_011)

#QQ Plot of Residuals
qqnorm(res_011)
qqline(res_011, col="red")

# Shapiro-Wilk Test on Residuals
print(shapiro.test(res_011))

    Shapiro-Wilk normality test

data:  res_011
W = 0.83451, p-value = 1.539e-09

#Model 012
par(mfrow = c(1, 1))

# Plot 1: Time series plot of standardized residuals
plot(res_012,
     xlab = "Index", ylab = "Residuals",
     main = "Time series plot of standardized residuals")
lines(res_012, col = "blue")
abline(h = 0.05, col = "red", lty = 2)

#Histogram
hist(res_012,breaks=20, col="skyblue",
     xlab="Residuals",ylab="Frequency",
     main="Histogram of Residuals")

# ACF & PACF of Residuals
acf_012 <- acf(res_012)

pacf_012 <- pacf(res_012)

#QQ Plot of Residuals
qqnorm(res_012)
qqline(res_012, col="red")

# Shapiro-Wilk Test on Residuals
print(shapiro.test(res_012))

    Shapiro-Wilk normality test

data:  res_012
W = 0.85275, p-value = 7.244e-09

Select best two models

Assess the chosen two models through accuracy test.
Perform residual analysis of the two models.
Select the best model from the above two models using the outcome of all the above analysis. This is going to be your final model.

Based on the provided results for AIC and BIC values, along with the significance of the coefficients, we will proceed as follows:

Model Selection:

The ARIMA(2,1,2) model has the lowest AIC value, indicating it is the best model according to AIC.
The ARIMA(1,1,1) model has the lowest BIC value, indicating it is the best model according to BIC.
These two models (ARIMA(2,1,2) and ARIMA(1,1,1)) will be assessed further through residual analysis and accuracy tests.

ARIMA Model Analysis

Model Identification

ADF Test:

The ADF test p-value (0.3385) indicates non-stationarity in the original series.
Differencing the series makes it stationary (p-value = 0.01).

ACF and PACF:

ACF and PACF plots of differenced series suggest possible ARIMA models with orders (p, d, q).

Model Estimation

ARIMA Models:

ARIMA(1,1,1), ARIMA(2,1,2), and ARIMA(3,1,3) were estimated.
ARIMA(2,1,2) has the lowest AIC (2078.834) and significant coefficients, making it the best model according to AIC.
ARIMA(1,1,1) has the lowest BIC (2089.336).

Model Validation

Residual Analysis:

ARIMA(2,1,2) residuals show better performance in terms of ACF, PACF, and QQ plot compared to ARIMA(1,1,1).
Shapiro-Wilk test for ARIMA(2,1,2) residuals (p-value = 1.18e-07) suggests some deviation from normality, but overall residuals are more acceptable.

Model Selection

ARIMA(2,1,2) is selected as the final model based on lower AIC, significant coefficients, and acceptable residual diagnostics.

4.5 Forecasting

bitcoin_fit <- Arima(btc_ts,c(0,1,1))
forecasted_data <- forecast(bitcoin_fit,h=12)
kable(forecasted_data, digits = 2, formats="html", row.names = TRUE) %>%
  kable_styling(full_width = F, font_size = 12, position = "center")

	Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95
Dec 2023	38014.11	32449.57	43578.66	29503.88	46524.35
Jan 2024	38014.11	29059.48	46968.75	24319.18	51709.05
Feb 2024	38014.11	26638.40	49389.83	20616.46	55411.76
Mar 2024	38014.11	24648.93	51379.30	17573.83	58454.40
Apr 2024	38014.11	22919.43	53108.80	14928.79	61099.44
May 2024	38014.11	21368.67	54659.56	12557.10	63471.13
Jun 2024	38014.11	19950.55	56077.68	10388.28	65639.95
Jul 2024	38014.11	18635.94	57392.29	8377.76	67650.47
Aug 2024	38014.11	17405.02	58623.21	6495.22	69533.01
Sep 2024	38014.11	16243.58	59784.65	4718.95	71309.28
Oct 2024	38014.11	15141.04	60887.19	3032.76	72995.47
Nov 2024	38014.11	14089.25	61938.98	1424.20	74604.03

12-Month Forecast:

Forecasted values show a point forecast along with confidence intervals (80% and 95%). The plot of the forecasted values provides a visual representation of the expected Bitcoin price trends.

4.6 Conclusion

a. Performance Comparison:

Linear Regression:

Pros: Simple to implement and interpret.

Cons: Limited by linearity assumption, residuals show non-normality and autocorrelation.

Quadratic Regression:

Pros: Can model slight curvature in the trend.

Cons: The quadratic term was not significant, similar performance to linear regression.

ARIMA:

Pros: Captures both trend and seasonality, well-suited for time series data.

Cons: More complex to implement and interpret, requires careful model identification and validation.

b. Final Model Selection:

The ARIMA(2,1,2) model was chosen as the final model due to its lower AIC value, significant coefficients, and better residual diagnostics compared to the other models. It provided the most accurate forecasts and effectively captured the underlying trends and seasonality in the Bitcoin price data.

c. Final Remarks

The ARIMA(2,1,2) model is appropriate for forecasting Bitcoin prices as it accounts for trends and autocorrelation in the data. The forecasting results provide valuable insights for the next 12 months, helping in making informed decisions based on the expected price movements.

Based on the analysis, the ARIMA(2,1,2) model was selected as the best model due to its lower AIC and BIC values and significant coefficients, better performance in the accuracy test. The residual analysis and Shapiro-Wilk test further confirmed the suitability of this model. Therefore, we will use this model for forecasting future Bitcoin prices.