1. Introduction

1.1 Background

Time series analysis is a critical tool for understanding and forecasting financial markets. This project focuses on Bitcoin (BTC), the leading cryptocurrency, to demonstrate a comprehensive approach to time series modeling. Cryptocurrencies exhibit unique characteristics including high volatility, 24/7 trading, and strong volatility clustering—making them ideal candidates for advanced time series techniques.

1.2 Objectives

This analysis aims to:

Explore and visualize the historical price and volume data
Test for stationarity using diagnostic plots and formal statistical tests
Build regression models to capture return dynamics
Model volatility using GARCH-family models
Generate forecasts for future returns with confidence intervals

1.3 Methodology Overview

Our approach follows a systematic pipeline:

Step	Description
Data Preparation	Load, clean, and validate the dataset
Stationarity Analysis	Confirm non-stationarity and transform to stationary series
Mean Modeling	Fit OLS and ARIMAX models for log returns
Volatility Modeling	Test for ARCH effects and fit eGARCH model
Forecasting	Generate 5-day ahead predictions with uncertainty quantification

# Load required libraries
library(dplyr)
library(ggplot2)
library(gridExtra)
library(forecast)
library(lmtest)
library(tseries)
library(zoo)
library(lubridate)
library(knitr)
library(rugarch)
library(FinTS)

2. Preliminary Analysis

2.1 Data Loading and Cleaning

We begin by loading the Bitcoin price data and performing essential data cleaning operations including date parsing, type conversion, and handling missing values.

# Load and clean data function
load_and_clean_data <- function(file_path) {
  # Read CSV file
  data <- read.csv(file_path, stringsAsFactors = FALSE)
  
  # Remove second row if it contains currency metadata
  if (nrow(data) > 1 && grepl("BTC-USD|USD", data[1, 2], ignore.case = TRUE)) {
    data <- data[-1, ]
  }
  
  # Parse Date column
  data$Date <- as.Date(data$Date, format = "%Y-%m-%d")
  
  # Convert numeric columns
  data$Close <- as.numeric(data$Close)
  data$Open <- as.numeric(data$Open)
  data$High <- as.numeric(data$High)
  data$Low <- as.numeric(data$Low)
  data$Volume <- as.numeric(data$Volume)
  
  # Sort by date (ascending)
  data <- data %>% arrange(Date)
  
  # Remove duplicate dates
  data <- data %>% distinct(Date, .keep_all = TRUE)
  
  # Remove rows with missing values
  data <- data %>% filter(!is.na(Date), !is.na(Close), !is.na(Volume))
  
  return(data)
}

# Function to check trading days
check_trading_days <- function(data) {
  data$Weekday <- wday(data$Date, week_start = 1)
  has_weekends <- any(data$Weekday %in% c(6, 7))
  total_days <- nrow(data)
  weekend_count <- sum(data$Weekday %in% c(6, 7))
  
  list(
    has_weekends = has_weekends,
    total_days = total_days,
    weekend_count = weekend_count
  )
}

# Load Bitcoin data
csv_file <- "archive/bitcoin.csv"
data <- load_and_clean_data(csv_file)

# Display dataset information
cat("Dataset Summary:\n")

## Dataset Summary:

cat("- Date Range:", as.character(min(data$Date)), "to", as.character(max(data$Date)), "\n")

## - Date Range: 2014-09-17 to 2026-01-05

cat("- Total Observations:", nrow(data), "\n")

## - Total Observations: 4129

cat("- Trading Days Info:\n")

## - Trading Days Info:

check_trading_days(data)

## $has_weekends
## [1] TRUE
## 
## $total_days
## [1] 4129
## 
## $weekend_count
## [1] 1180

2.2 Exploratory Data Analysis

2.2.1 Price and Volume Trends

# Close Price over time
p1 <- ggplot(data, aes(x = Date, y = Close)) +
  geom_line(color = "steelblue", linewidth = 0.6) +
  labs(title = "Bitcoin Close Price Over Time",
       x = "Date", 
       y = "Close Price (USD)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Volume over time
p2 <- ggplot(data, aes(x = Date, y = Volume)) +
  geom_line(color = "darkgreen", linewidth = 0.6) +
  labs(title = "Bitcoin Trading Volume Over Time",
       x = "Date", 
       y = "Volume") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p1, p2, ncol = 1)

2.2.2 Summary Statistics

# Calculate summary statistics
summary_stats <- data %>%
  summarise(
    Mean_Price = mean(Close, na.rm = TRUE),
    Median_Price = median(Close, na.rm = TRUE),
    SD_Price = sd(Close, na.rm = TRUE),
    Min_Price = min(Close, na.rm = TRUE),
    Max_Price = max(Close, na.rm = TRUE),
    Mean_Volume = mean(Volume, na.rm = TRUE)
  )

kable(summary_stats, caption = "Summary Statistics for Bitcoin", digits = 2)

Summary Statistics for Bitcoin
Mean_Price	Median_Price	SD_Price	Min_Price	Max_Price	Mean_Volume
26925.96	10779.9	31709.9	178.1	124752.5	21643259806

2.3 Stationarity Analysis

2.3.1 Why Stationarity Matters

Time series models assume stationarity—constant mean and variance over time. Financial price levels typically violate this assumption due to trends and changing volatility. We must transform the data to achieve stationarity before modeling.

2.3.2 Visual Diagnostics

# Create log-transformed series
data$log_Close <- log(data$Close)
data$log_Volume <- log(data$Volume)

# Calculate log returns
data$returns <- c(NA, diff(data$log_Close))

# Rolling statistics window
window <- min(60, floor(nrow(data) / 10))

# Calculate rolling mean and SD for Close price
data$Close_rolling_mean <- rollmean(data$Close, k = window, fill = NA, align = "right")
data$Close_rolling_sd <- rollapply(data$Close, width = window, FUN = sd, fill = NA, align = "right")

# Plot 1: Close Price with rolling statistics
p1 <- ggplot(data, aes(x = Date)) +
  geom_line(aes(y = Close), color = "steelblue", linewidth = 0.6) +
  geom_line(aes(y = Close_rolling_mean), color = "red", linewidth = 1, linetype = "dashed", na.rm = TRUE) +
  geom_ribbon(aes(ymin = Close_rolling_mean - Close_rolling_sd, 
                  ymax = Close_rolling_mean + Close_rolling_sd), 
              alpha = 0.2, fill = "red", na.rm = TRUE) +
  labs(title = "Close Price with Rolling Mean ± SD (60-day window)",
       subtitle = "Non-stationary: trending mean and changing variance",
       x = "Date", y = "Close Price") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Plot 2: ACF of Close Price
acf_close <- acf(data$Close, plot = FALSE, na.action = na.pass)
acf_df_close <- data.frame(Lag = acf_close$lag[,1,1], ACF = acf_close$acf[,1,1]) %>% 
  filter(Lag > 0)
p2 <- ggplot(acf_df_close, aes(x = Lag, y = ACF)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(data$Close)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "ACF of Close Price",
       subtitle = "Slow decay → non-stationarity",
       x = "Lag", y = "ACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p1, p2, ncol = 2)

Diagnosis: Close price shows a clear trend and ACF values near 1 with slow decay → non-stationary.

Solution: Use log returns: \(r_t = \log(C_t) - \log(C_{t-1})\)

2.3.3 Log Returns: The Stationary Transformation

ggplot(data, aes(x = Date, y = returns)) +
  geom_line(color = "steelblue", linewidth = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed", alpha = 0.7) +
  labs(title = "Log Returns: r(t) = log(Cₜ) - log(Cₜ₋₁)",
       subtitle = "Mean-reverting around zero with volatility clustering",
       x = "Date", 
       y = "Log Returns") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Remove NA values
returns_clean <- na.omit(data$returns)

# ADF test 
# H0: Unit root (non-stationary)
adf.test(returns_clean)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  returns_clean
## Dickey-Fuller = -14.595, Lag order = 16, p-value = 0.01
## alternative hypothesis: stationary

# KPSS test 
# H0: Stationary
kpss.test(returns_clean, null = "Trend")

## 
##  KPSS Test for Trend Stationarity
## 
## data:  returns_clean
## KPSS Trend = 0.077407, Truncation lag parameter = 10, p-value = 0.1

Conclusion: Log returns are stationary—ADF rejects the unit root hypothesis, and KPSS fails to reject stationarity. Visually, they are mean-reverting around ~0 with changing variance over time, i.e., much closer to stationary (in mean) and exhibiting volatility clustering. We can proceed with modeling.

3. Model Creation

3.1 Data Preparation for Modeling

We create the dependent variable (log returns) and relevant predictors:

Lagged return (\(r_{t-1}\)): captures momentum/mean-reversion
Volume change (\(\Delta\log(Vol_t)\)): captures trading activity shifts
Intraday range (\(\log(H_t/L_t)\)): volatility proxy

prepare_regression_data <- function(data) {
  # Dependent variable: log returns
  data$log_returns <- c(NA, diff(log(data$Close)))
  
  # Regressors
  data$lag_return <- lag(data$log_returns, 1)
  data$intraday_range <- log(data$High / data$Low)
  data$log_Volume <- log(data$Volume)
  data$delta_log_Volume <- c(NA, diff(data$log_Volume))
  
  # Remove incomplete observations
  data <- data %>% filter(!is.na(log_returns), 
                          !is.na(lag_return),
                          !is.na(delta_log_Volume),
                          !is.na(intraday_range))
  return(data)
}

data_reg <- prepare_regression_data(data)
cat("Observations for modeling:", nrow(data_reg), "\n")

## Observations for modeling: 4127

3.2 Model 1: OLS Regression

3.2.1 Model Specification

\[r_t = \beta_0 + \beta_1 r_{t-1} + \beta_2 \Delta\log(Vol_t) + \beta_3 \log(H_t/L_t) + \varepsilon_t\]

model1 <- lm(log_returns ~ lag_return + delta_log_Volume + intraday_range, 
             data = data_reg)
summary(model1)

## 
## Call:
## lm(formula = log_returns ~ lag_return + delta_log_Volume + intraday_range, 
##     data = data_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41120 -0.01483 -0.00204  0.01393  0.25323 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.006891   0.000890   7.743 1.22e-14 ***
## lag_return       -0.035920   0.015536  -2.312   0.0208 *  
## delta_log_Volume  0.007332   0.001733   4.231 2.37e-05 ***
## intraday_range   -0.128431   0.016223  -7.916 3.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03511 on 4123 degrees of freedom
## Multiple R-squared:  0.01615,    Adjusted R-squared:  0.01543 
## F-statistic: 22.56 on 3 and 4123 DF,  p-value: 1.752e-14

You found small but real mean effects: slight mean reversion, volume-associated drift, and a strong negative relationship with intraday volatility.

But the model is not a strong return predictor (R² ~ 1.6%).

Residual behavior likely indicates heteroskedasticity and heavy tails, so pairing this with ARMA errors and/or GARCH volatility is the natural next step.

3.2.2 Residual Diagnostics

res1 <- residuals(model1)

# ACF of residuals
acf_res <- acf(res1, lag.max = 40, plot = FALSE, na.action = na.pass)
acf_df <- data.frame(Lag = acf_res$lag[,1,1], ACF = acf_res$acf[,1,1])

p_acf <- ggplot(acf_df, aes(x = Lag, y = ACF)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(res1)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "ACF of OLS Residuals",
       x = "Lag", y = "ACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# PACF of residuals
pacf_res <- pacf(res1, plot = FALSE, na.action = na.pass)
pacf_df <- data.frame(Lag = as.numeric(pacf_res$lag), PACF = as.numeric(pacf_res$acf))

p_pacf <- ggplot(pacf_df, aes(x = Lag, y = PACF)) +
  geom_bar(stat = "identity", fill = "darkgreen", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(res1)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "PACF of OLS Residuals",
       x = "Lag", y = "PACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p_acf, p_pacf, ncol = 2)

# Ljung-Box test
Box.test(res1, lag = 10, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  res1
## X-squared = 23.231, df = 10, p-value = 0.009927

Interpretation:

Residual ACF is mostly near zero, but a small number of lags exceed the 95% bounds, indicating mild residual autocorrelation.

Given the large sample size, even small correlations can be statistically significant; Ljung–Box confirms residuals are not perfectly white noise.

For forecasting: upgrade to regression with ARMA errors / ARIMAX, because the residuals aren’t fully independent.

3.3 Model 2: ARIMAX (Regression with ARMA Errors)

We model the regression error as an ARMA process. Since cryptocurrencies trade 7 days/week, we also test a seasonal component with period \(s=7\).

y <- data_reg$log_returns
xreg <- cbind(data_reg$lag_return, data_reg$delta_log_Volume, data_reg$intraday_range)

# Non-seasonal ARIMAX
fit_ns <- auto.arima(y, xreg = xreg, seasonal = FALSE)

# Seasonal ARIMAX (weekly seasonality)
y7 <- ts(y, frequency = 7)
fit_s <- auto.arima(y7, xreg = xreg, seasonal = TRUE)

# Model comparison
comparison <- data.frame(
  Model = c("Non-Seasonal ARIMAX", "Seasonal ARIMAX (s=7)"),
  AIC = c(AIC(fit_ns), AIC(fit_s)),
  BIC = c(BIC(fit_ns), BIC(fit_s))
)
kable(comparison, digits = 2, caption = "Model Selection Criteria")

Model Selection Criteria
Model	AIC	BIC
Non-Seasonal ARIMAX	-15956.29	-15899.36
Seasonal ARIMAX (s=7)	-15964.46	-15901.21

checkresiduals(fit_s)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(2,0,2)(0,0,1)[7] errors
## Q* = 11.901, df = 9, p-value = 0.2189
## 
## Model df: 5.   Total lags used: 14

Selected Model: Seasonal ARIMAX with ARIMA\((2,0,2)(0,0,1)_{7}\) errors achieves lower BIC and passes the Ljung-Box test.

3.4 Volatility Modeling (eGARCH)

3.4.1 Testing for ARCH Effects

Financial time series often exhibit volatility clustering—large returns tend to be followed by large returns. We test for ARCH effects in the ARIMAX residuals.

res2 <- residuals(fit_s)
res2 <- res2[!is.na(res2)]

# Visual check: ACF of squared residuals (volatility clustering)
acf_sq <- acf(res2^2, lag.max = 40, plot = FALSE, na.action = na.pass)
acf_sq_df <- data.frame(Lag = as.numeric(acf_sq$lag[,1,1]),
                        ACF = as.numeric(acf_sq$acf[,1,1]))

ggplot(acf_sq_df, aes(x = Lag, y = ACF)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(res2)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "ACF of Squared Residuals",
       subtitle = "Significant spikes indicate volatility clustering / ARCH effects",
       x = "Lag", y = "ACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Formal ARCH-LM tests
# H0: No ARCH effects (homoskedasticity) # 

ArchTest(res2, lags = 7)

## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  res2
## Chi-squared = 197.26, df = 7, p-value < 2.2e-16

ArchTest(res2, lags = 14)

## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  res2
## Chi-squared = 209.63, df = 14, p-value < 2.2e-16

ArchTest(res2, lags = 21)

## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  res2
## Chi-squared = 220.49, df = 21, p-value < 2.2e-16

Conclusion: Strong ARCH effects detected (p < 0.05). We proceed with GARCH modeling.

3.4.2 Fitting the eGARCH Model

We use an eGARCH(1,1) model with skewed Student-t innovations to capture:

Asymmetric volatility response (leverage effect)
Heavy tails in the return distribution

# eGARCH specification
spec <- ugarchspec(
  variance.model = list(model = "eGARCH", garchOrder = c(1,1)),
  mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
  distribution.model = "sstd"
)

# Fit on ARIMAX residuals
fit_vol <- ugarchfit(spec, data = res2, solver = "hybrid")
show(fit_vol)

## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics    
## -----------------------------------
## GARCH Model  : eGARCH(1,1)
## Mean Model   : ARFIMA(0,0,0)
## Distribution : sstd 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error   t value Pr(>|t|)
## omega  -0.098688    0.032346  -3.05099 0.002281
## alpha1 -0.008444    0.014671  -0.57557 0.564906
## beta1   0.984950    0.005059 194.67468 0.000000
## gamma1  0.226942    0.039111   5.80253 0.000000
## skew    1.139943    0.017382  65.58249 0.000000
## shape   2.879734    0.154490  18.64026 0.000000
## 
## Robust Standard Errors:
##         Estimate  Std. Error   t value Pr(>|t|)
## omega  -0.098688    0.059388  -1.66176 0.096560
## alpha1 -0.008444    0.023182  -0.36426 0.715665
## beta1   0.984950    0.009557 103.05761 0.000000
## gamma1  0.226942    0.078354   2.89635 0.003775
## skew    1.139943    0.015684  72.68141 0.000000
## shape   2.879734    0.172840  16.66129 0.000000
## 
## LogLikelihood : 8856.944 
## 
## Information Criteria
## ------------------------------------
##                     
## Akaike       -4.2893
## Bayes        -4.2801
## Shibata      -4.2893
## Hannan-Quinn -4.2860
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                         statistic   p-value
## Lag[1]                      12.10 0.0005051
## Lag[2*(p+q)+(p+q)-1][2]     12.78 0.0003522
## Lag[4*(p+q)+(p+q)-1][5]     14.57 0.0005797
## d.o.f=0
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                      2.006  0.1566
## Lag[2*(p+q)+(p+q)-1][5]     2.820  0.4408
## Lag[4*(p+q)+(p+q)-1][9]     3.278  0.7122
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]     1.025 0.500 2.000  0.3113
## ARCH Lag[5]     1.190 1.440 1.667  0.6776
## ARCH Lag[7]     1.311 2.315 1.543  0.8584
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  3.306
## Individual Statistics:             
## omega  0.1661
## alpha1 1.0930
## beta1  0.1642
## gamma1 1.4742
## skew   0.4420
## shape  0.1234
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:          1.49 1.68 2.12
## Individual Statistic:     0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value    prob sig
## Sign Bias            2.116 0.03442  **
## Negative Sign Bias   1.294 0.19565    
## Positive Sign Bias   1.291 0.19682    
## Joint Effect         4.795 0.18740    
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     23.06      0.23468
## 2    30     38.04      0.12140
## 3    40     49.12      0.12853
## 4    50     63.20      0.08361
## 
## 
## Elapsed time : 0.345021

Interpretation:

The eGARCH(1,1) model with skewed Student-\(t\) innovations fits volatility well: volatility is highly persistent \((\beta_1 = 0.985,\ p < 0.001)\) and there is sign asymmetry/leverage \((\gamma_1 = 0.227,\ \text{robust } p = 0.0038)\). The shock-magnitude term is not significant \((\alpha_1,\ p > 0.7)\). The skewed-\(t\) parameters are strongly significant (skew \(= 1.14\), shape \(= 2.88\)), confirming skewness and fat tails.

There is no remaining ARCH/volatility clustering (Ljung–Box on squared residuals \(p = 0.16\text{–}0.71\); ARCH–LM \(p = 0.31\text{–}0.86\)), but standardized residuals are still autocorrelated (Ljung–Box \(p < 0.001\)), suggesting leftover mean dynamics not captured in this specific fit. Parameter stability is a concern: the Nyblom joint test rejects stability \((3.306 > 2.12)\), especially for \(\alpha_1\) and \(\gamma_1\). Sign-bias is mildly significant \((p = 0.034)\), while overall distributional fit is acceptable (Pearson GoF \(p \approx 0.08\text{–}0.23\)).

4. Forecasting

4.1 Generating 5-Day Ahead Forecasts

We combine the mean model (seasonal ARIMAX) and volatility model (eGARCH) to generate forecasts with proper uncertainty quantification.

# Forecast horizon
h <- 5

# Future regressors (use last known values as proxy)
last_values <- tail(data_reg, 1)
xreg_future <- matrix(
  rep(c(last_values$lag_return, last_values$delta_log_Volume, last_values$intraday_range), h),
  nrow = h, ncol = 3, byrow = TRUE
)

# Forecast mean returns
fc_mean <- forecast(fit_s, h = h, xreg = xreg_future)
mu_hat <- as.numeric(fc_mean$mean)

# Forecast volatility
res_all <- as.numeric(residuals(fit_s))
res_all <- res_all[is.finite(res_all)]

spec <- ugarchspec(
  variance.model = list(model = "eGARCH", garchOrder = c(1,1)),
  mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
  distribution.model = "sstd"
)
fit_vol <- ugarchfit(spec, data = res_all, solver = "hybrid")
fc_vol <- ugarchforecast(fit_vol, n.ahead = h)
sigma_hat <- as.numeric(sigma(fc_vol))

4.2 Forecast Results

# Create forecast table
last_date <- max(data_reg$Date)
forecast_dates <- last_date + 1:h

forecast_results <- data.frame(
  Date = forecast_dates,
  Forecasted_Return = mu_hat,
  Forecasted_Volatility = sigma_hat,
  Lower_95 = mu_hat - 1.96 * sigma_hat,
  Upper_95 = mu_hat + 1.96 * sigma_hat
)

kable(forecast_results, digits = 6, caption = "Return Forecasts with 95% Confidence Intervals")

Return Forecasts with 95% Confidence Intervals
Date	Forecasted_Return	Forecasted_Volatility	Lower_95	Upper_95
2026-01-06	0.014201	0.023308	-0.031482	0.059885
2026-01-07	0.002576	0.023477	-0.043440	0.048591
2026-01-08	0.002770	0.023645	-0.043574	0.049114
2026-01-09	0.002733	0.023811	-0.043938	0.049403
2026-01-10	0.002592	0.023976	-0.044401	0.049586

4.3 Forecast Visualization

# Plot 1: Return forecasts with confidence intervals
p1 <- ggplot(forecast_results, aes(x = Date)) +
  geom_ribbon(aes(ymin = Lower_95, ymax = Upper_95), alpha = 0.3, fill = "steelblue") +
  geom_line(aes(y = Forecasted_Return), color = "steelblue", linewidth = 1.2) +
  geom_point(aes(y = Forecasted_Return), color = "steelblue", size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "5-Day Log Return Forecast",
       subtitle = "Shaded region: 95% confidence interval",
       x = "Date",
       y = "Forecasted Log Return") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Plot 2: Volatility forecast
p2 <- ggplot(forecast_results, aes(x = Date, y = Forecasted_Volatility)) +
  geom_line(color = "darkgreen", linewidth = 1.2) +
  geom_point(color = "darkgreen", size = 3) +
  geom_area(alpha = 0.3, fill = "darkgreen") +
  labs(title = "5-Day Volatility Forecast",
       subtitle = "Forecasted standard deviation of returns",
       x = "Date",
       y = "Forecasted Volatility (σ)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p1, p2, ncol = 1)

4.4 Historical Context

# Show recent returns with forecast
recent_data <- tail(data_reg, 30)
recent_returns <- data.frame(
  Date = recent_data$Date,
  Return = recent_data$log_returns,
  Type = "Historical"
)

forecast_data <- data.frame(
  Date = forecast_results$Date,
  Return = forecast_results$Forecasted_Return,
  Type = "Forecast"
)

combined <- rbind(recent_returns, forecast_data)

ggplot(combined, aes(x = Date, y = Return, color = Type)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  geom_ribbon(data = forecast_results, 
              aes(x = Date, ymin = Lower_95, ymax = Upper_95), 
              inherit.aes = FALSE, alpha = 0.2, fill = "coral") +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c("Historical" = "steelblue", "Forecast" = "coral")) +
  labs(title = "Recent History and 5-Day Forecast",
       subtitle = "With 95% confidence interval for forecasts",
       x = "Date",
       y = "Log Return") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.position = "bottom")

5. Summary

Key Findings

Aspect	Result
Data	4129 observations covering 2014-09-17 to 2026-01-05
Stationarity	Price levels non-stationary; log returns are stationary (ADF/KPSS confirmed)
Mean Model	Seasonal ARIMAX(2,0,2)(1,0,0)₇ with external regressors
Volatility Model	eGARCH(1,1) with skewed Student-t distribution
ARCH Effects	Strong evidence of volatility clustering (p < 0.001)

Methodology Summary

Preliminary Analysis: Loaded and cleaned data, confirmed Bitcoin trades 7 days/week, identified non-stationarity in price levels
Model Creation:
- Started with OLS regression on log returns
- Upgraded to seasonal ARIMAX to capture residual autocorrelation
- Added eGARCH to model time-varying volatility
Forecasting: Generated 5-day ahead predictions combining mean and volatility forecasts with 95% confidence intervals

Limitations and Future Work

Parameter stability (regimes): Stability tests indicated time-varying parameters; consider rolling/expanding re-estimation or regime-switching volatility models.
Regressor forecasts: Future regressors were held constant (last observation carried forward); forecasting regressors or using scenario-based inputs could improve multi-step forecasts.
Model extensions: Compare asymmetric variants (e.g., GJR-GARCH, APARCH) and alternative innovation distributions.
External factors: Add macro/sentiment/on-chain variables to capture exogenous volatility drivers.
Out-of-sample evaluation: Use rolling-window backtests and risk-focused metrics (VaR/ES backtests, QLIKE) to assess forecast robustness.

Time Series Analysis: Bitcoin Price Forecasting

Nhi Dinh

1. Introduction

1.1 Background

1.2 Objectives

1.3 Methodology Overview

2. Preliminary Analysis

2.1 Data Loading and Cleaning

2.2 Exploratory Data Analysis

2.2.1 Price and Volume Trends

2.2.2 Summary Statistics

2.3 Stationarity Analysis

2.3.1 Why Stationarity Matters

2.3.2 Visual Diagnostics

2.3.3 Log Returns: The Stationary Transformation

3. Model Creation

3.1 Data Preparation for Modeling

3.2 Model 1: OLS Regression

3.2.1 Model Specification

3.2.2 Residual Diagnostics

3.3 Model 2: ARIMAX (Regression with ARMA Errors)

3.4 Volatility Modeling (eGARCH)

3.4.1 Testing for ARCH Effects

3.4.2 Fitting the eGARCH Model

4. Forecasting

4.1 Generating 5-Day Ahead Forecasts

4.2 Forecast Results

4.3 Forecast Visualization

4.4 Historical Context

5. Summary

Key Findings

Methodology Summary

Limitations and Future Work