1. Introduction

1.1 Background

Time series analysis is a critical tool for understanding and forecasting financial markets. This project focuses on Bitcoin (BTC), the leading cryptocurrency, to demonstrate a comprehensive approach to time series modeling. Cryptocurrencies exhibit unique characteristics including high volatility, 24/7 trading, and strong volatility clustering—making them ideal candidates for advanced time series techniques.

1.2 Objectives

This analysis aims to:

  1. Explore and visualize the historical price and volume data
  2. Test for stationarity using diagnostic plots and formal statistical tests
  3. Build regression models to capture return dynamics
  4. Model volatility using GARCH-family models
  5. Generate forecasts for future returns with confidence intervals

1.3 Methodology Overview

Our approach follows a systematic pipeline:

Step Description
Data Preparation Load, clean, and validate the dataset
Stationarity Analysis Confirm non-stationarity and transform to stationary series
Mean Modeling Fit OLS and ARIMAX models for log returns
Volatility Modeling Test for ARCH effects and fit eGARCH model
Forecasting Generate 5-day ahead predictions with uncertainty quantification
# Load required libraries
library(dplyr)
library(ggplot2)
library(gridExtra)
library(forecast)
library(lmtest)
library(tseries)
library(zoo)
library(lubridate)
library(knitr)
library(rugarch)
library(FinTS)

2. Preliminary Analysis

2.1 Data Loading and Cleaning

We begin by loading the Bitcoin price data and performing essential data cleaning operations including date parsing, type conversion, and handling missing values.

# Load and clean data function
load_and_clean_data <- function(file_path) {
  # Read CSV file
  data <- read.csv(file_path, stringsAsFactors = FALSE)
  
  # Remove second row if it contains currency metadata
  if (nrow(data) > 1 && grepl("BTC-USD|USD", data[1, 2], ignore.case = TRUE)) {
    data <- data[-1, ]
  }
  
  # Parse Date column
  data$Date <- as.Date(data$Date, format = "%Y-%m-%d")
  
  # Convert numeric columns
  data$Close <- as.numeric(data$Close)
  data$Open <- as.numeric(data$Open)
  data$High <- as.numeric(data$High)
  data$Low <- as.numeric(data$Low)
  data$Volume <- as.numeric(data$Volume)
  
  # Sort by date (ascending)
  data <- data %>% arrange(Date)
  
  # Remove duplicate dates
  data <- data %>% distinct(Date, .keep_all = TRUE)
  
  # Remove rows with missing values
  data <- data %>% filter(!is.na(Date), !is.na(Close), !is.na(Volume))
  
  return(data)
}

# Function to check trading days
check_trading_days <- function(data) {
  data$Weekday <- wday(data$Date, week_start = 1)
  has_weekends <- any(data$Weekday %in% c(6, 7))
  total_days <- nrow(data)
  weekend_count <- sum(data$Weekday %in% c(6, 7))
  
  list(
    has_weekends = has_weekends,
    total_days = total_days,
    weekend_count = weekend_count
  )
}
# Load Bitcoin data
csv_file <- "archive/bitcoin.csv"
data <- load_and_clean_data(csv_file)

# Display dataset information
cat("Dataset Summary:\n")
## Dataset Summary:
cat("- Date Range:", as.character(min(data$Date)), "to", as.character(max(data$Date)), "\n")
## - Date Range: 2014-09-17 to 2026-01-05
cat("- Total Observations:", nrow(data), "\n")
## - Total Observations: 4129
cat("- Trading Days Info:\n")
## - Trading Days Info:
check_trading_days(data)
## $has_weekends
## [1] TRUE
## 
## $total_days
## [1] 4129
## 
## $weekend_count
## [1] 1180

2.2 Exploratory Data Analysis

2.2.2 Summary Statistics

# Calculate summary statistics
summary_stats <- data %>%
  summarise(
    Mean_Price = mean(Close, na.rm = TRUE),
    Median_Price = median(Close, na.rm = TRUE),
    SD_Price = sd(Close, na.rm = TRUE),
    Min_Price = min(Close, na.rm = TRUE),
    Max_Price = max(Close, na.rm = TRUE),
    Mean_Volume = mean(Volume, na.rm = TRUE)
  )

kable(summary_stats, caption = "Summary Statistics for Bitcoin", digits = 2)
Summary Statistics for Bitcoin
Mean_Price Median_Price SD_Price Min_Price Max_Price Mean_Volume
26925.96 10779.9 31709.9 178.1 124752.5 21643259806

2.3 Stationarity Analysis

2.3.1 Why Stationarity Matters

Time series models assume stationarity—constant mean and variance over time. Financial price levels typically violate this assumption due to trends and changing volatility. We must transform the data to achieve stationarity before modeling.

2.3.2 Visual Diagnostics

# Create log-transformed series
data$log_Close <- log(data$Close)
data$log_Volume <- log(data$Volume)

# Calculate log returns
data$returns <- c(NA, diff(data$log_Close))

# Rolling statistics window
window <- min(60, floor(nrow(data) / 10))

# Calculate rolling mean and SD for Close price
data$Close_rolling_mean <- rollmean(data$Close, k = window, fill = NA, align = "right")
data$Close_rolling_sd <- rollapply(data$Close, width = window, FUN = sd, fill = NA, align = "right")
# Plot 1: Close Price with rolling statistics
p1 <- ggplot(data, aes(x = Date)) +
  geom_line(aes(y = Close), color = "steelblue", linewidth = 0.6) +
  geom_line(aes(y = Close_rolling_mean), color = "red", linewidth = 1, linetype = "dashed", na.rm = TRUE) +
  geom_ribbon(aes(ymin = Close_rolling_mean - Close_rolling_sd, 
                  ymax = Close_rolling_mean + Close_rolling_sd), 
              alpha = 0.2, fill = "red", na.rm = TRUE) +
  labs(title = "Close Price with Rolling Mean ± SD (60-day window)",
       subtitle = "Non-stationary: trending mean and changing variance",
       x = "Date", y = "Close Price") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Plot 2: ACF of Close Price
acf_close <- acf(data$Close, plot = FALSE, na.action = na.pass)
acf_df_close <- data.frame(Lag = acf_close$lag[,1,1], ACF = acf_close$acf[,1,1]) %>% 
  filter(Lag > 0)
p2 <- ggplot(acf_df_close, aes(x = Lag, y = ACF)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(data$Close)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "ACF of Close Price",
       subtitle = "Slow decay → non-stationarity",
       x = "Lag", y = "ACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p1, p2, ncol = 2)

Diagnosis: Close price shows a clear trend and ACF values near 1 with slow decay → non-stationary.

Solution: Use log returns: \(r_t = \log(C_t) - \log(C_{t-1})\)

2.3.3 Log Returns: The Stationary Transformation

ggplot(data, aes(x = Date, y = returns)) +
  geom_line(color = "steelblue", linewidth = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed", alpha = 0.7) +
  labs(title = "Log Returns: r(t) = log(Cₜ) - log(Cₜ₋₁)",
       subtitle = "Mean-reverting around zero with volatility clustering",
       x = "Date", 
       y = "Log Returns") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Remove NA values
returns_clean <- na.omit(data$returns)

# ADF test 
# H0: Unit root (non-stationary)
adf.test(returns_clean)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  returns_clean
## Dickey-Fuller = -14.595, Lag order = 16, p-value = 0.01
## alternative hypothesis: stationary
# KPSS test 
# H0: Stationary
kpss.test(returns_clean, null = "Trend")
## 
##  KPSS Test for Trend Stationarity
## 
## data:  returns_clean
## KPSS Trend = 0.077407, Truncation lag parameter = 10, p-value = 0.1

Conclusion: Log returns are stationary—ADF rejects the unit root hypothesis, and KPSS fails to reject stationarity. Visually, they are mean-reverting around ~0 with changing variance over time, i.e., much closer to stationary (in mean) and exhibiting volatility clustering. We can proceed with modeling.


3. Model Creation

3.1 Data Preparation for Modeling

We create the dependent variable (log returns) and relevant predictors:

  • Lagged return (\(r_{t-1}\)): captures momentum/mean-reversion
  • Volume change (\(\Delta\log(Vol_t)\)): captures trading activity shifts
  • Intraday range (\(\log(H_t/L_t)\)): volatility proxy
prepare_regression_data <- function(data) {
  # Dependent variable: log returns
  data$log_returns <- c(NA, diff(log(data$Close)))
  
  # Regressors
  data$lag_return <- lag(data$log_returns, 1)
  data$intraday_range <- log(data$High / data$Low)
  data$log_Volume <- log(data$Volume)
  data$delta_log_Volume <- c(NA, diff(data$log_Volume))
  
  # Remove incomplete observations
  data <- data %>% filter(!is.na(log_returns), 
                          !is.na(lag_return),
                          !is.na(delta_log_Volume),
                          !is.na(intraday_range))
  return(data)
}

data_reg <- prepare_regression_data(data)
cat("Observations for modeling:", nrow(data_reg), "\n")
## Observations for modeling: 4127

3.2 Model 1: OLS Regression

3.2.1 Model Specification

\[r_t = \beta_0 + \beta_1 r_{t-1} + \beta_2 \Delta\log(Vol_t) + \beta_3 \log(H_t/L_t) + \varepsilon_t\]

model1 <- lm(log_returns ~ lag_return + delta_log_Volume + intraday_range, 
             data = data_reg)
summary(model1)
## 
## Call:
## lm(formula = log_returns ~ lag_return + delta_log_Volume + intraday_range, 
##     data = data_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41120 -0.01483 -0.00204  0.01393  0.25323 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.006891   0.000890   7.743 1.22e-14 ***
## lag_return       -0.035920   0.015536  -2.312   0.0208 *  
## delta_log_Volume  0.007332   0.001733   4.231 2.37e-05 ***
## intraday_range   -0.128431   0.016223  -7.916 3.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03511 on 4123 degrees of freedom
## Multiple R-squared:  0.01615,    Adjusted R-squared:  0.01543 
## F-statistic: 22.56 on 3 and 4123 DF,  p-value: 1.752e-14

You found small but real mean effects: slight mean reversion, volume-associated drift, and a strong negative relationship with intraday volatility.

But the model is not a strong return predictor (R² ~ 1.6%).

Residual behavior likely indicates heteroskedasticity and heavy tails, so pairing this with ARMA errors and/or GARCH volatility is the natural next step.

3.2.2 Residual Diagnostics

res1 <- residuals(model1)

# ACF of residuals
acf_res <- acf(res1, lag.max = 40, plot = FALSE, na.action = na.pass)
acf_df <- data.frame(Lag = acf_res$lag[,1,1], ACF = acf_res$acf[,1,1])

p_acf <- ggplot(acf_df, aes(x = Lag, y = ACF)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(res1)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "ACF of OLS Residuals",
       x = "Lag", y = "ACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# PACF of residuals
pacf_res <- pacf(res1, plot = FALSE, na.action = na.pass)
pacf_df <- data.frame(Lag = as.numeric(pacf_res$lag), PACF = as.numeric(pacf_res$acf))

p_pacf <- ggplot(pacf_df, aes(x = Lag, y = PACF)) +
  geom_bar(stat = "identity", fill = "darkgreen", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(res1)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "PACF of OLS Residuals",
       x = "Lag", y = "PACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p_acf, p_pacf, ncol = 2)

# Ljung-Box test
Box.test(res1, lag = 10, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  res1
## X-squared = 23.231, df = 10, p-value = 0.009927

Interpretation:

Residual ACF is mostly near zero, but a small number of lags exceed the 95% bounds, indicating mild residual autocorrelation.

Given the large sample size, even small correlations can be statistically significant; Ljung–Box confirms residuals are not perfectly white noise.

For forecasting: upgrade to regression with ARMA errors / ARIMAX, because the residuals aren’t fully independent.

3.3 Model 2: ARIMAX (Regression with ARMA Errors)

We model the regression error as an ARMA process. Since cryptocurrencies trade 7 days/week, we also test a seasonal component with period \(s=7\).

y <- data_reg$log_returns
xreg <- cbind(data_reg$lag_return, data_reg$delta_log_Volume, data_reg$intraday_range)

# Non-seasonal ARIMAX
fit_ns <- auto.arima(y, xreg = xreg, seasonal = FALSE)

# Seasonal ARIMAX (weekly seasonality)
y7 <- ts(y, frequency = 7)
fit_s <- auto.arima(y7, xreg = xreg, seasonal = TRUE)

# Model comparison
comparison <- data.frame(
  Model = c("Non-Seasonal ARIMAX", "Seasonal ARIMAX (s=7)"),
  AIC = c(AIC(fit_ns), AIC(fit_s)),
  BIC = c(BIC(fit_ns), BIC(fit_s))
)
kable(comparison, digits = 2, caption = "Model Selection Criteria")
Model Selection Criteria
Model AIC BIC
Non-Seasonal ARIMAX -15956.29 -15899.36
Seasonal ARIMAX (s=7) -15964.46 -15901.21
checkresiduals(fit_s)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(2,0,2)(0,0,1)[7] errors
## Q* = 11.901, df = 9, p-value = 0.2189
## 
## Model df: 5.   Total lags used: 14

Selected Model: Seasonal ARIMAX with ARIMA\((2,0,2)(0,0,1)_{7}\) errors achieves lower BIC and passes the Ljung-Box test.

3.4 Volatility Modeling (eGARCH)

3.4.1 Testing for ARCH Effects

Financial time series often exhibit volatility clustering—large returns tend to be followed by large returns. We test for ARCH effects in the ARIMAX residuals.

res2 <- residuals(fit_s)
res2 <- res2[!is.na(res2)]

# Visual check: ACF of squared residuals (volatility clustering)
acf_sq <- acf(res2^2, lag.max = 40, plot = FALSE, na.action = na.pass)
acf_sq_df <- data.frame(Lag = as.numeric(acf_sq$lag[,1,1]),
                        ACF = as.numeric(acf_sq$acf[,1,1]))

ggplot(acf_sq_df, aes(x = Lag, y = ACF)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(yintercept = c(-1.96, 1.96) / sqrt(length(res2)), 
             linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "ACF of Squared Residuals",
       subtitle = "Significant spikes indicate volatility clustering / ARCH effects",
       x = "Lag", y = "ACF") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Formal ARCH-LM tests
# H0: No ARCH effects (homoskedasticity) # 

ArchTest(res2, lags = 7)
## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  res2
## Chi-squared = 197.26, df = 7, p-value < 2.2e-16
ArchTest(res2, lags = 14)
## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  res2
## Chi-squared = 209.63, df = 14, p-value < 2.2e-16
ArchTest(res2, lags = 21)
## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  res2
## Chi-squared = 220.49, df = 21, p-value < 2.2e-16

Conclusion: Strong ARCH effects detected (p < 0.05). We proceed with GARCH modeling.

3.4.2 Fitting the eGARCH Model

We use an eGARCH(1,1) model with skewed Student-t innovations to capture:

  • Asymmetric volatility response (leverage effect)
  • Heavy tails in the return distribution
# eGARCH specification
spec <- ugarchspec(
  variance.model = list(model = "eGARCH", garchOrder = c(1,1)),
  mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
  distribution.model = "sstd"
)

# Fit on ARIMAX residuals
fit_vol <- ugarchfit(spec, data = res2, solver = "hybrid")
show(fit_vol)
## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics    
## -----------------------------------
## GARCH Model  : eGARCH(1,1)
## Mean Model   : ARFIMA(0,0,0)
## Distribution : sstd 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error   t value Pr(>|t|)
## omega  -0.098688    0.032346  -3.05099 0.002281
## alpha1 -0.008444    0.014671  -0.57557 0.564906
## beta1   0.984950    0.005059 194.67468 0.000000
## gamma1  0.226942    0.039111   5.80253 0.000000
## skew    1.139943    0.017382  65.58249 0.000000
## shape   2.879734    0.154490  18.64026 0.000000
## 
## Robust Standard Errors:
##         Estimate  Std. Error   t value Pr(>|t|)
## omega  -0.098688    0.059388  -1.66176 0.096560
## alpha1 -0.008444    0.023182  -0.36426 0.715665
## beta1   0.984950    0.009557 103.05761 0.000000
## gamma1  0.226942    0.078354   2.89635 0.003775
## skew    1.139943    0.015684  72.68141 0.000000
## shape   2.879734    0.172840  16.66129 0.000000
## 
## LogLikelihood : 8856.944 
## 
## Information Criteria
## ------------------------------------
##                     
## Akaike       -4.2893
## Bayes        -4.2801
## Shibata      -4.2893
## Hannan-Quinn -4.2860
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                         statistic   p-value
## Lag[1]                      12.10 0.0005051
## Lag[2*(p+q)+(p+q)-1][2]     12.78 0.0003522
## Lag[4*(p+q)+(p+q)-1][5]     14.57 0.0005797
## d.o.f=0
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                      2.006  0.1566
## Lag[2*(p+q)+(p+q)-1][5]     2.820  0.4408
## Lag[4*(p+q)+(p+q)-1][9]     3.278  0.7122
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]     1.025 0.500 2.000  0.3113
## ARCH Lag[5]     1.190 1.440 1.667  0.6776
## ARCH Lag[7]     1.311 2.315 1.543  0.8584
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  3.306
## Individual Statistics:             
## omega  0.1661
## alpha1 1.0930
## beta1  0.1642
## gamma1 1.4742
## skew   0.4420
## shape  0.1234
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:          1.49 1.68 2.12
## Individual Statistic:     0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value    prob sig
## Sign Bias            2.116 0.03442  **
## Negative Sign Bias   1.294 0.19565    
## Positive Sign Bias   1.291 0.19682    
## Joint Effect         4.795 0.18740    
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     23.06      0.23468
## 2    30     38.04      0.12140
## 3    40     49.12      0.12853
## 4    50     63.20      0.08361
## 
## 
## Elapsed time : 0.345021

Interpretation:

The eGARCH(1,1) model with skewed Student-\(t\) innovations fits volatility well: volatility is highly persistent \((\beta_1 = 0.985,\ p < 0.001)\) and there is sign asymmetry/leverage \((\gamma_1 = 0.227,\ \text{robust } p = 0.0038)\). The shock-magnitude term is not significant \((\alpha_1,\ p > 0.7)\). The skewed-\(t\) parameters are strongly significant (skew \(= 1.14\), shape \(= 2.88\)), confirming skewness and fat tails.

There is no remaining ARCH/volatility clustering (Ljung–Box on squared residuals \(p = 0.16\text{–}0.71\); ARCH–LM \(p = 0.31\text{–}0.86\)), but standardized residuals are still autocorrelated (Ljung–Box \(p < 0.001\)), suggesting leftover mean dynamics not captured in this specific fit. Parameter stability is a concern: the Nyblom joint test rejects stability \((3.306 > 2.12)\), especially for \(\alpha_1\) and \(\gamma_1\). Sign-bias is mildly significant \((p = 0.034)\), while overall distributional fit is acceptable (Pearson GoF \(p \approx 0.08\text{–}0.23\)).


4. Forecasting

4.1 Generating 5-Day Ahead Forecasts

We combine the mean model (seasonal ARIMAX) and volatility model (eGARCH) to generate forecasts with proper uncertainty quantification.

# Forecast horizon
h <- 5

# Future regressors (use last known values as proxy)
last_values <- tail(data_reg, 1)
xreg_future <- matrix(
  rep(c(last_values$lag_return, last_values$delta_log_Volume, last_values$intraday_range), h),
  nrow = h, ncol = 3, byrow = TRUE
)

# Forecast mean returns
fc_mean <- forecast(fit_s, h = h, xreg = xreg_future)
mu_hat <- as.numeric(fc_mean$mean)

# Forecast volatility
res_all <- as.numeric(residuals(fit_s))
res_all <- res_all[is.finite(res_all)]

spec <- ugarchspec(
  variance.model = list(model = "eGARCH", garchOrder = c(1,1)),
  mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
  distribution.model = "sstd"
)
fit_vol <- ugarchfit(spec, data = res_all, solver = "hybrid")
fc_vol <- ugarchforecast(fit_vol, n.ahead = h)
sigma_hat <- as.numeric(sigma(fc_vol))

4.2 Forecast Results

# Create forecast table
last_date <- max(data_reg$Date)
forecast_dates <- last_date + 1:h

forecast_results <- data.frame(
  Date = forecast_dates,
  Forecasted_Return = mu_hat,
  Forecasted_Volatility = sigma_hat,
  Lower_95 = mu_hat - 1.96 * sigma_hat,
  Upper_95 = mu_hat + 1.96 * sigma_hat
)

kable(forecast_results, digits = 6, caption = "Return Forecasts with 95% Confidence Intervals")
Return Forecasts with 95% Confidence Intervals
Date Forecasted_Return Forecasted_Volatility Lower_95 Upper_95
2026-01-06 0.014201 0.023308 -0.031482 0.059885
2026-01-07 0.002576 0.023477 -0.043440 0.048591
2026-01-08 0.002770 0.023645 -0.043574 0.049114
2026-01-09 0.002733 0.023811 -0.043938 0.049403
2026-01-10 0.002592 0.023976 -0.044401 0.049586

4.3 Forecast Visualization

# Plot 1: Return forecasts with confidence intervals
p1 <- ggplot(forecast_results, aes(x = Date)) +
  geom_ribbon(aes(ymin = Lower_95, ymax = Upper_95), alpha = 0.3, fill = "steelblue") +
  geom_line(aes(y = Forecasted_Return), color = "steelblue", linewidth = 1.2) +
  geom_point(aes(y = Forecasted_Return), color = "steelblue", size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", alpha = 0.7) +
  labs(title = "5-Day Log Return Forecast",
       subtitle = "Shaded region: 95% confidence interval",
       x = "Date",
       y = "Forecasted Log Return") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Plot 2: Volatility forecast
p2 <- ggplot(forecast_results, aes(x = Date, y = Forecasted_Volatility)) +
  geom_line(color = "darkgreen", linewidth = 1.2) +
  geom_point(color = "darkgreen", size = 3) +
  geom_area(alpha = 0.3, fill = "darkgreen") +
  labs(title = "5-Day Volatility Forecast",
       subtitle = "Forecasted standard deviation of returns",
       x = "Date",
       y = "Forecasted Volatility (σ)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p1, p2, ncol = 1)

4.4 Historical Context

# Show recent returns with forecast
recent_data <- tail(data_reg, 30)
recent_returns <- data.frame(
  Date = recent_data$Date,
  Return = recent_data$log_returns,
  Type = "Historical"
)

forecast_data <- data.frame(
  Date = forecast_results$Date,
  Return = forecast_results$Forecasted_Return,
  Type = "Forecast"
)

combined <- rbind(recent_returns, forecast_data)

ggplot(combined, aes(x = Date, y = Return, color = Type)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  geom_ribbon(data = forecast_results, 
              aes(x = Date, ymin = Lower_95, ymax = Upper_95), 
              inherit.aes = FALSE, alpha = 0.2, fill = "coral") +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c("Historical" = "steelblue", "Forecast" = "coral")) +
  labs(title = "Recent History and 5-Day Forecast",
       subtitle = "With 95% confidence interval for forecasts",
       x = "Date",
       y = "Log Return") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.position = "bottom")


5. Summary

Key Findings

Aspect Result
Data 4129 observations covering 2014-09-17 to 2026-01-05
Stationarity Price levels non-stationary; log returns are stationary (ADF/KPSS confirmed)
Mean Model Seasonal ARIMAX(2,0,2)(1,0,0)₇ with external regressors
Volatility Model eGARCH(1,1) with skewed Student-t distribution
ARCH Effects Strong evidence of volatility clustering (p < 0.001)

Methodology Summary

  1. Preliminary Analysis: Loaded and cleaned data, confirmed Bitcoin trades 7 days/week, identified non-stationarity in price levels

  2. Model Creation:

    • Started with OLS regression on log returns
    • Upgraded to seasonal ARIMAX to capture residual autocorrelation
    • Added eGARCH to model time-varying volatility
  3. Forecasting: Generated 5-day ahead predictions combining mean and volatility forecasts with 95% confidence intervals

Limitations and Future Work

  • Parameter stability (regimes): Stability tests indicated time-varying parameters; consider rolling/expanding re-estimation or regime-switching volatility models.
  • Regressor forecasts: Future regressors were held constant (last observation carried forward); forecasting regressors or using scenario-based inputs could improve multi-step forecasts.
  • Model extensions: Compare asymmetric variants (e.g., GJR-GARCH, APARCH) and alternative innovation distributions.
  • External factors: Add macro/sentiment/on-chain variables to capture exogenous volatility drivers.
  • Out-of-sample evaluation: Use rolling-window backtests and risk-focused metrics (VaR/ES backtests, QLIKE) to assess forecast robustness.