Carbon Allowance Forecast w/ Predictors

Carbon Allowance Clearing Price Forecasting

Environmental concerns surrounding global warming and fossil fuels are mounting with each new climate-related disaster. As governments and firms look to a future of environmental sustainability, new policies governing CO2 emissions will be needed. A popular model proposed by economists is a carbon tax credit market. Market-based pricing is valuable since it can act as a mechanism that directly measures and reflects the costs of carbon emissions while considering economic efficiency in a market structure.

This idea is not new, but has some limited real-world application up to this point. In the US, the only such market with a cap on CO2 is operated by the regional Greenhouse Gas Initiative (RGGI), which includes 11 states in the Northeast US. Through RGGI, CO2 emissions from power plants are capped, and firms must purchase CO2 allowances via quarterly auction. Most firms mandated to participate are utility companies that supply power using traditional fossil fuels such as coal or natural gas.

The volatility of energy markets and the constant demand for energy across the board make the transition away from fossil fuels difficult to implement. Concerns about US energy independence and the effectiveness of renewables are proving to make the transition a slow one. The carbon allowance auction system puts a monetary value, a price, on emissions, and lets the market for these emissions play itself out. We are living in a time where energy and emissions are some of if not the most debated economic, political, and social issues. An evaluation of the trend and future price of carbon allowances could present an idea for how carbon allowances would be awarded in other auction-oriented markets. It can also give a better understanding of potential predictors or causal influences in the future supply, demand, and price of allowance credits.

1 Data and data cleaning

Data comes directly from the RGGI website, which provides information on clearing prices as well as reviews of individual auctions.

Most carbon allowance price forecasting literature includes a variety of predictors, including general economic indicators, fossil fuel prices, temperature, population, and other financial indicators, such as blue-chip stock prices.

Carbon allowances can be a a financial investment, but most RGGI allowances are purchased by mandated companies, mostly in utilities. For this reason, price and supply and demand factors for fossil fuels are likely to influence demand for allowances. Population in the region may also influence price, as an increase in population increases energy demand and thus the need for mandated companies to purchase carbon allowances through the RGGI auction.

From FRED we can use quarterly measures of

GDP
M1 Money Supply
Population
Natural Gas Spot Price
Global Price of Coal
Global Price of Brent Crude
Nasdaq 100

# helpful packages

rm(list = ls()) 
  gc()

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 508289 27.2    1117592 59.7         NA   669322 35.8
## Vcells 936535  7.2    8388608 64.0      16384  1839947 14.1

  cat("\f")

packages <- c("readr", #open csv
              "readxl", # open excel file
              "psych", # quick summary stats for data exploration,
              "stargazer", #summary stats for sharing,
              "tidyverse", # data manipulation like selecting variables,
              "dplyr", # data manipulation
              "scales", # scaling axes for plots
              "lubridate",
              "stringr", # manipulate string vars
              "corrplot", # correlation plots
              "ggplot2", # graphing
              "ggcorrplot", # correlation plot
              "gridExtra", #overlay plots
              "data.table", # reshape for graphing 
              "car", #vif
              "prettydoc", # html output
              "visdat", # visualize missing variables
              "glmnet", # lasso/ridge
              "caret", # confusion matrix
              "MASS", #step AIC
              "plm", # fixed effects demeaned regression
              "lmtest", # test regression coefficients
              "prophet", # facebook prophet for ts forecasts
              "fpp3", # Foprecasting: Principles & Practice supplement
              "fDMA", # DMA forecasting
              "eDMA", # DMA forecasting
              "tsibble", 
              "feasts",
              "tsibbledata",
              "lubridate",
              "forecast",
              "kernlab" # LS-SVM forecasting
)

for (i in 1:length(packages)) {
  if (!packages[i] %in% rownames(installed.packages())) {
    install.packages(packages[i]
                     , repos = "http://cran.rstudio.com/"
                     , dependencies = TRUE
    )
  }
  library(packages[i], character.only = TRUE)
}

rm(packages)

# data is clean! 
df %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  glimpse()

## Rows: 1
## Columns: 15
## $ Quarter            <int> 0
## $ Auction            <int> 0
## $ Date               <int> 0
## $ `Quantity Offered` <int> 0
## $ `Quantity Sold`    <int> 0
## $ `Clearing Price`   <int> 0
## $ `Total Proceeds`   <int> 0
## $ GDP                <int> 0
## $ M1                 <int> 0
## $ Population         <int> 0
## $ `Price NatGas`     <int> 0
## $ `Price Crude`      <int> 0
## $ `Price Coal`       <int> 0
## $ NASDAQ100          <int> 0
## $ FFR                <int> 0

2 Data analysis

stargazer(df,          
          type = "text",       
          digits = 2,           
          summary.stat = c("mean","sd","min","max"), 
          title = "Summary Statistics")

## 
## Summary Statistics
## ===============================
## Statistic Mean St. Dev. Min Max
## ===============================

2.1 Data distributions

With this plot, we can look at the distribution of values for important variables in the dataset. ‘Clearing Price’ is out variable of interest (y).

Many of the variables appear to have noisy distributions. This could present an issue in forecasting and means data transformations might be helpful. Most literature uses log normal data.

df_ggplot <- melt(df[,4:15])

ggplot(data    = df_ggplot,
       mapping =  aes(x=value)
       ) + geom_histogram() + facet_wrap(. ~ variable, scale='free')

ggplot(df, aes(x = df$`Clearing Price`)) +
  geom_histogram(fill = "blue", color = "black") +
  labs(title = "Distribution of Clearing Prices", x = "Clearing Price", y = "Frequency")

2.2 Variable correlation

A quick correlation plot can give a hint as to which variables are most correlated before we build the forecast. Most predictors look to be pretty strongly correlated with Clearing Price.

df_corr <- subset(df, select = -c(Quarter, Auction, Date))

corrplot(cor(df_corr), 
         type = "lower", 
         method = "square"#"number" "circle"
         )

2.3 Time series plots

2.3.1 Auction data

Over the 60 quarters observed, the clearing price for RGGI carbon allowances has risen, with sharp increase beginning around 2020.

ggplot(df, aes(x = Date, y = df$`Clearing Price`)) +
  geom_line(color = "blue") +
  labs(title = "Clearing Prices Over Time", x = "Date", y = "Clearing Price ($)")

The plot for total proceeds increases in a similar fashion, meaning price is moving up despite the consistent number of allowances offered, as shown below.

ggplot(df, aes(x = Date)) +
  geom_line(aes(y = `Total Proceeds`, color = "Total Proceeds"), size = 1) +
  geom_line(aes(y = `Quantity Offered`, color = "Quantity Offered"), size = 1) +
  scale_color_manual(values = c("Total Proceeds" = "blue", "Clearing Price" = "red", "Quantity Offered" = "green")) +
  labs(x = "Date", y = "Total Proceeds, USD") +
  theme_minimal()

All allowances offered are typically sold.

ggplot(df, aes(x = Date)) +
  geom_line(aes(y = `Quantity Offered`, color = "Quantity Offered"), size = 1) +
  geom_line(aes(y = `Quantity Sold`, color = "Quantity Sold"), size = 1) +
  scale_color_manual(values = c("Quantity Offered" = "blue", "Quantity Sold" = "red")) +
  labs(title = "Quantity Offered and Quantity Sold Over Time", x = "Date", y = "Quantity") +
  theme_minimal()

Together, the participating states have established a regional cap on CO2 emissions, which sets a limit on the emissions from regulated power plants within the RGGI states. Over time, the regional cap declines, so that CO2 emissions decrease in a planned and predictable way. Since its inception, RGGI emissions have reduced by more than 50%—twice as fast as the nation as a whole—and so far raised nearly $6 billion to invest into local communities.

The supply curve is entirely determined by the quantity of carbon emission allowance credits offered by the RGGI. The basic Co2 cap model used y RGII implements a steady decline in allowances (and thus the overall cap) over time, as well as an increase of 2.5% on the minimum price for each allowance each year. Since the number of states participating in RGGI over the period of its existence, the supply of credits has fluctuated.

ggplot(df, aes(x = Date)) +
  geom_line(aes(y = `Quantity Offered`, color = "Quantity Offered"), size = 1) +
  scale_color_manual(values = c("Quantity Offered" = "blue")) +
  labs(title = "Quantity Offered Over Time", x = "Date", y = "Quantity") +
  scale_y_continuous(labels = comma, limits = c(0, NA)) +
  theme_minimal()

2.3.2 Predictors

Because the supply curve is predetermined through the RGGI’s own mechanism and state participation structure, the important unknown’s in determining clearing prices for CO2 allowances will be found in causal factors determining the demand curve.

Buyers in the RGGI carbon market may express higher demand for credits when there is higher energy demand or higher prices of alternative fuels. They may also experience pressures to offset emission as temperatures rise and are associated with a warming global climate as a result of intense consumption of CO2 emitting fossil fuels. Good predictors will be statistically significant in determining demand for credits, which is the quantity of credits sold at each auction.

ggplot(df, aes(x = Date)) +
  geom_line(aes(y = df$`Price NatGas`, color = "Natural Gas Price"), size = 1) +
  geom_line(aes(y = df$`Price Crude`, color = "Crude Oil Price"), size = 1) +
  geom_line(aes(y = df$`Price Coal`, color = "Coal Price"), size = 1) +
  scale_color_manual(values = c("Natural Gas Price" = "blue", "Crude Oil Price" = "red", "Coal Price" = "green")) +
  labs(title = "Price Trends Over Time", x = "Date", y = "Price") +
  theme_minimal()

It looks like the price of crude and coal move together over time.

ggplot(df, aes(x = Date, y = df$`Price NatGas`)) +
  geom_line(color = "blue") +
  labs(title = "Natural Gas Prices Over Time", x = "Date", y = " Price ($)")

Clearing price may experience some seasonality in a similar way to natural gas prices.

ggplot(df, aes(x = Date)) +
  geom_line(aes(y = df$`Price NatGas`, color = "Natural Gas Price"), size = 1) +
  geom_line(aes(y = df$`Clearing Price`, color = "Clearing Price"), size = 1) + 
  scale_color_manual(values = c("Natural Gas Price" = "blue",  "Clearing Price" = "green")) +
  labs(title = "Natural Gas and Clearing Price", x = "Date", y = "Price") +
  theme_minimal()

3 Data transformations and adjustments

Many of the variables are not normally distributed over the time period, as seen in the histogram plots above. Most literature simply log-normalizes variables, including clearing price.

# removing RGGI factors from dataset so models run easier
# the fact that the data is a time series does not matter for this application

df_ts <- df %>%
  mutate(Date = yearquarter(Date)) %>%
  as_tsibble(index = Date)

df_ts$'GDP per Capita' <- df_ts$GDP / df_ts$Population

df_ts <- df_ts[, !colnames(df_ts) %in% c("GDP", "Population")]

columns_to_exclude <- c("Quarter", "Auction", "Date", "FFR")

# Log-normalize all columns except those in columns_to_exclude
df_ts_ln <- df_ts %>%
  mutate(across(-all_of(columns_to_exclude), log))
# Create a new data frame with the specified variables removed
df_new <- df_ts_ln[, !names(df_ts) %in% c("Quarter", "Auction", "Date", "Quantity Offered", "Total Proceeds", "Quantity Sold")]

4 Variable selection

To identify the predictors that are best for determining demand for emission allowances, some techniques for variable selection will be limited. The goal in this process is to parse out which variables are strongest in prediction the Quantity of allowances sold so that a more effective causal interpretation of the corresponding coefficients can be used. Thus, the strongest predictors will be used to forecast future demand and price.

Variable selection will not include factors related to the RGGI auction itself, such as Quantity Offered, Total Proceeds, or Clearing Price obviously). These RGGI factors will be referred to as ‘institutional factors’. the goal is to determine which outside factors are good predictors of demand, so economic factors will be used in the following selection methods.

4.1 Simple OLS

A simple OLS may elucidate which predictors are strongest while ignoring any time element. With Clearing Price as the dependent variable, the OLS displays statistical significance from fossil fuel prices as well as some general macro factors such as GDP and M1, and less so for FFR.

The coefficients for group of fossil fuel price indices are also notable. Only natural gas has a positive coefficient, meaning higher natural gas prices correlate with higher carbon emission clearing prices, while the price of Brent crude and coal move inverse to the price of emissions.

lm_model <- lm(`Clearing Price` ~ ., data = df_new)
stargazer(lm_model, type = 'text', digits = 3)

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                          `Clearing Price`      
## -----------------------------------------------
## M1                           0.279***          
##                               (0.095)          
##                                                
## `Price NatGas`               0.546***          
##                               (0.124)          
##                                                
## `Price Crude`                 -0.181           
##                               (0.114)          
##                                                
## `Price Coal`                 -0.552***         
##                               (0.136)          
##                                                
## NASDAQ100                    -0.915***         
##                               (0.234)          
##                                                
## FFR                           -0.066           
##                               (0.046)          
##                                                
## `GDP per Capita`             7.504***          
##                               (1.265)          
##                                                
## Constant                     30.892***         
##                               (5.678)          
##                                                
## -----------------------------------------------
## Observations                    60             
## R2                             0.903           
## Adjusted R2                    0.890           
## Residual Std. Error       0.202 (df = 52)      
## F Statistic           69.296*** (df = 7; 52)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

lm2 <- lm(`Clearing Price` ~ `Price NatGas` + `Price Crude` + `Price Coal` + `GDP per Capita`, data = df_new)
stargazer(lm2, type = 'text', digits = 3)

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                          `Clearing Price`      
## -----------------------------------------------
## `Price NatGas`               0.634***          
##                               (0.135)          
##                                                
## `Price Crude`                -0.312**          
##                               (0.125)          
##                                                
## `Price Coal`                 -0.308**          
##                               (0.130)          
##                                                
## `GDP per Capita`             4.090***          
##                               (0.317)          
##                                                
## Constant                     15.051***         
##                               (1.123)          
##                                                
## -----------------------------------------------
## Observations                    60             
## R2                             0.865           
## Adjusted R2                    0.855           
## Residual Std. Error       0.232 (df = 55)      
## F Statistic           87.852*** (df = 4; 55)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

#MSE
lm_predictions <- predict(lm2, newdata = df_new)

mse_ols <- mean(df_new$`Clearing Price`- lm_predictions)^2
cat("MSE for OLS: ", mse_ols, "\n")

## MSE for OLS:  9.0809e-31

# Display OLS coefficients
ols_coefficients <- coef(lm2)
print(ols_coefficients)

##      (Intercept)   `Price NatGas`    `Price Crude`     `Price Coal` 
##       15.0509699        0.6335335       -0.3116699       -0.3079229 
## `GDP per Capita` 
##        4.0902733

4.2 Step AIC

The backward select function is another way of displaying the significance of the potential predictors in a model. It also using linear modelling, so the results are essentially the same as OLS. Particularly, GDP and the fossil fuel prices are the best predictors of the clearing price.

#Step AIC for variable selection
bw_select <- stepAIC(object = lm(data = df_new, df_new$`Clearing Price` ~ .),
               direction = c("backward"))

## Start:  AIC=-184.44
## df_new$`Clearing Price` ~ M1 + `Price NatGas` + `Price Crude` + 
##     `Price Coal` + NASDAQ100 + FFR + `GDP per Capita`
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          2.1250 -184.44
## - FFR               1   0.08314 2.2081 -184.13
## - `Price Crude`     1   0.10367 2.2286 -183.58
## - M1                1   0.35197 2.4769 -177.24
## - NASDAQ100         1   0.62766 2.7526 -170.91
## - `Price Coal`      1   0.66928 2.7942 -170.01
## - `Price NatGas`    1   0.79404 2.9190 -167.39
## - `GDP per Capita`  1   1.43744 3.5624 -155.44

4.3 Ridge

y <- df_new$`Clearing Price`

# Perform Ridge regression
X <- as.matrix(df_new[, -which(names(df_new) == "Clearing Price")])
alpha <- 0  # Set alpha to 0 for Ridge regression
ridge_model <- glmnet(X, y, alpha = alpha)

# Cross-validate Ridge to find the optimal lambda (alpha) value
cvfit <- cv.glmnet(X, y, alpha = alpha)

# Print the optimal lambda value from cross-validation
best_lambda <- cvfit$lambda.min
cat("Optimal Lambda for Ridge: ", best_lambda, "\n")

## Optimal Lambda for Ridge:  0.05383646

# Fit Ridge regression model with the optimal lambda
ridge_model_optimal <- glmnet(X, y, alpha = alpha, lambda = best_lambda)

#MSE for Ridge
ridge_predictions <- predict(ridge_model_optimal, s = best_lambda, newx = X)

mse_ridge <- mean((y - ridge_predictions)^2)
cat("MSE for Ridge: ", mse_ridge, "\n")

## MSE for Ridge:  0.05238869

# Display Ridge coefficients
ridge_coefficients <- coef(ridge_model_optimal)
print(ridge_coefficients)

## 8 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)     5.03233169
## M1              0.22366989
## Price NatGas    0.41452655
## Price Crude    -0.31441934
## Price Coal     -0.13688221
## NASDAQ100       0.11943988
## FFR             0.03825435
## GDP per Capita  1.76099980

After analyzing the results from these variable selection tools, it appears price levels of coal, crude, and natural gas will be strong predictors of clearing price. In addition, macro variables such as GDP per Capita also appear to be strong predictors.

4.4 Assessing multicollinearity

We want to utilize multicollinearity of residuals or seasonality in our forecast, not necessarily between predictors. Multicollinearity makes causality more difficult and may result in over-fitting and poor forecasts. The variance in factor (VIF) test is commonly used to test multicollinearity, with the threshold for output from VIF being <10. Some literature in fcarbon price forecasting uses a VIF threshold of 5, and the same will be applied here.

Price of natural gas, coal, and crude, as well as GDP per capita are statistically significant in predicting price in an OLS setting and pass the VIF threshold of 5.

vif(lm2)

##   `Price NatGas`    `Price Crude`     `Price Coal` `GDP per Capita` 
##         2.465007         1.964074         4.312772         2.598867

5 Univariate forecasts

5.1 Prophet model

The prophet model developed by forecasters at facebook uses trend, seasonality, and holiday effects to make forecasts. Parameters will have to be adjusted to tell the package we are using quarterly data. The formula looks like:

\[ y(t) = g(t) + s(t) + h(t) + \varepsilon_t \]

$y(t)$ represents the observed time series data at time $t$,

$g(t)$ is the trend component,

$s(t)$ is the seasonal component,

$h(t)$ represents holiday effects, and $\varepsilon_t$ is the error term.

# we have a data frame auction_data2 with "Date" in "YYYY-MM-DD" format
df2 <- df %>%
  mutate(ds = Date, # will have to convert to date if not already
         y = df$`Clearing Price`)  # Rename "Clearing Price" to "y"

# resampling data to quarterly intervals
auction_data_quarterly <- df2 %>%
  mutate(quarter = quarter(ds),  # Extract the quarter from the date
         year = year(ds)) %>%
  group_by(year, quarter) %>%
  summarize(y = mean(y)) %>%
  ungroup() %>%
  mutate(ds = ymd(paste(year, quarter * 3, "01", sep = "-")))

prophet_model <- prophet(auction_data_quarterly)
future <- make_future_dataframe(prophet_model, periods = 4, freq = "quarter")
forecast <- predict(prophet_model, future)

prophet_plot <- plot(prophet_model, forecast)
print(prophet_plot)

The components plots shows the upward trend identified by the model. There also appears to be some seasonal components throughout the year in the quarterly allowance measure.

components_plot <- prophet_plot_components(prophet_model, forecast)

5.2 ARIMA

# Remove the "Quarter" and "Auction" columns
df_ts_ln <- df_ts_ln[, !colnames(df_ts_ln) %in% c("Quarter", "Auction")]

# Create a time series object with 'Clearing Price' as the response variable
ts_data <- ts(df_ts_ln$`Clearing Price`, frequency = 4)

y <- ts_data

arima_model <- auto.arima(y)

forecast_result1 <- forecast(arima_model, h = 4)

# Print the forecast result
print(summary(forecast_result1))

## 
## Forecast method: ARIMA(0,1,0)(0,0,1)[4]
## 
## Model Information:
## Series: y 
## ARIMA(0,1,0)(0,0,1)[4] 
## 
## Coefficients:
##         sma1
##       0.3520
## s.e.  0.1524
## 
## sigma^2 = 0.0222:  log likelihood = 28.84
## AIC=-53.69   AICc=-53.47   BIC=-49.53
## 
## Error measures:
##                      ME     RMSE        MAE      MPE     MAPE      MASE
## Training set 0.01665962 0.146503 0.09495261 0.561187 7.419489 0.3500124
##                    ACF1
## Training set 0.06685957
## 
## Forecasts:
##       Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 16 Q1       2.521452 2.330491 2.712413 2.229403 2.813502
## 16 Q2       2.471133 2.201073 2.741192 2.058113 2.884153
## 16 Q3       2.453810 2.123056 2.784564 1.947966 2.959654
## 16 Q4       2.458566 2.076645 2.840488 1.874468 3.042665

autoplot(forecast_result1)

checkresiduals(forecast_result1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)(0,0,1)[4]
## Q* = 6.5707, df = 7, p-value = 0.4749
## 
## Model df: 1.   Total lags used: 8

ARIMA with train/test split:

# Assuming you have a time series object ts_data

# Set the seed for reproducibility
set.seed(123)

# Determine the size of the training set (e.g., 80% for training, 20% for testing)
train_size <- floor(0.8 * length(ts_data))

# Generate a random index to split the time series
split_index <- sample(1:length(ts_data), train_size)

# Create training and test sets
train_data <- ts_data[split_index]
test_data <- ts_data[-split_index]

# Fit an ARIMA model to the training set
arima_model <- auto.arima(train_data)

# Forecast using the ARIMA model on the test set
forecast_result1 <- forecast(arima_model, h = length(test_data))

# Print or use the forecasted values
print(forecast_result1)

##    Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 49       1.757069 1.1040558 2.410082 0.7583715 2.755767
## 50       1.275249 0.6134803 1.937018 0.2631611 2.287337
## 51       1.430224 0.7332240 2.127224 0.3642544 2.496194
## 52       1.517769 0.7715987 2.263940 0.3765999 2.658939
## 53       1.355698 0.5943434 2.117052 0.1913068 2.520089
## 54       1.470313 0.7088002 2.231826 0.3056796 2.634947
## 55       1.438857 0.6750028 2.202712 0.2706426 2.607072
## 56       1.412759 0.6460629 2.179456 0.2401982 2.585320
## 57       1.453773 0.6862927 2.221254 0.2800130 2.627534
## 58       1.426722 0.6592401 2.194203 0.2529599 2.600483
## 59       1.432778 0.6651288 2.200428 0.2587597 2.606797
## 60       1.440271 0.6724501 2.208092 0.2659903 2.614552

autoplot(forecast_result1)

rmse_arima <- sqrt(mean((forecast_result1$mean - test_data)^2))
cat("ARIMA RMSE:", rmse_arima, "\n")

## ARIMA RMSE: 0.7508137

6 Forecasts with predictors

These more advanced methodologies have been used to forecast carbon price equilibrium in academic research.

6.1 ARIMA

# Convert predictor variables to a numeric matrix, excluding "Date"
xreg_matrix <- as.matrix(df_ts_ln[, -which(names(df_ts_ln) %in% c("Clearing Price", "Date", "Quantity Offered", "Quantity Sold", "Total Proceeds", "FFR", "M1"))])

# Fit an ARIMA model with the numeric matrix of predictor variables
arima_model2 <- auto.arima(y, xreg = xreg_matrix)

fc.ng <- forecast(xreg_matrix[,1], h=4)
fc.crude <- forecast(xreg_matrix[,2], h =4)
fc.coal <- forecast(xreg_matrix[,3], h =4)
fc.nasdaq <- forecast(xreg_matrix[,4], h =4)
fc.gdp <- forecast(xreg_matrix[,5], h =4)

newxreg <- as.matrix(cbind(fc.ng$mean, fc.crude$mean, fc.coal$mean, fc.nasdaq$mean, fc.gdp$mean))

cprice <- forecast(arima_model2, xreg = newxreg)

## Warning in forecast.forecast_ARIMA(arima_model2, xreg = newxreg): xreg contains
## different column names from the xreg used in training. Please check that the
## regressors are in the same order.

# Print the forecast result
print(summary(cprice))

## 
## Forecast method: Regression with ARIMA(1,0,0) errors
## 
## Model Information:
## Series: y 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##          ar1  intercept  Price NatGas  Price Crude  Price Coal  NASDAQ100
##       0.8281     8.9823        0.1842      -0.3460      0.0196     0.1395
## s.e.  0.0731     4.1273        0.1144       0.1264      0.1340     0.1969
##       GDP per Capita
##               2.6459
## s.e.          0.9032
## 
## sigma^2 = 0.02189:  log likelihood = 32.66
## AIC=-49.32   AICc=-46.5   BIC=-32.57
## 
## Error measures:
##                        ME      RMSE        MAE       MPE     MAPE      MASE
## Training set -0.004563315 0.1390471 0.09987599 -2.075982 8.400741 0.3681609
##                    ACF1
## Training set 0.07238647
## 
## Forecasts:
##       Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 16 Q1       2.537955 2.348356 2.727554 2.247988 2.827921
## 16 Q2       2.537903 2.291738 2.784068 2.161426 2.914380
## 16 Q3       2.542678 2.264291 2.821065 2.116922 2.968434
## 16 Q4       2.551438 2.252960 2.849916 2.094956 3.007921

autoplot(cprice)

checkresiduals(cprice)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,0,0) errors
## Q* = 6.5675, df = 7, p-value = 0.4753
## 
## Model df: 1.   Total lags used: 8

To obtain a better model fit, values of the $y$ ans $xreg$ matrix were log-normalized. For the final forecast, these values need to be exponentiated so forecasts are in real terms.

# Exponentiate the forecast values for the response variable
final_forecast <- exp(cprice$mean)

# Exponentiate the forecast values for the predictor variables
newxreg_exp <- exp(newxreg)

# Ensure that the lengths match
n_rows <- min(nrow(cprice$mean), nrow(newxreg_exp))

# Combine the response variable forecast with the predictor variable forecasts
final_forecast_df <- data.frame(Date = cprice$x[1:n_rows], Final_Forecast = final_forecast[1:n_rows])
final_forecast_df <- cbind(final_forecast_df, newxreg_exp[1:n_rows, ])

# Print or use final_forecast_df as needed
print(final_forecast_df) # point forecast

##       Date Final_Forecast fc.ng$mean fc.crude$mean fc.coal$mean fc.nasdaq$mean
## 1 1.121678       12.65376    2.16214      78.08654    119.08372       14309.83
## 2 1.217876       12.65311    2.16214      78.08654     87.82685       14819.78
## 3 1.255616       12.71368    2.16214      78.08654     68.48577       15347.90
## 4 1.172482       12.82554    2.16214      78.08654     55.89115       15894.84
##   fc.gdp$mean
## 1  0.08149218
## 2  0.08219593
## 3  0.08290577
## 4  0.08362173

6.2 DMA

The Dynamic Model Analysis (DMA) methodology is popular in modelling carbon forecasting and commodities generally due to its ability to account for a large number of predictors and adjust regression coefficients as the time series evolves. Since there are some predictors we are interested in and their influence on the dependent variable (carbon price) may change over time, DMA may produce a stronger forecast than the multivariate ARIMA.

6.2.1 fDMA package

The fDMA package uses static forgetting factors, usually with $\alpha$ = $\lambda$ , with lower values of forgetting factors giving more weight to recent observations in forecasts, which can be useful in this particular auction data due to the recent rise in clearing prices compared to the beginning of program in 2008.

#fDMA package
dma_model1 <- fDMA(y, xreg_matrix, .90, .90, .01)
print(dma_model1)

## Model:  ~ const + Price NatGas + Price Crude + Price Coal + NASDAQ100 + GDP per Capita
## 
##            alpha lambda initvar model type W   initial period V.meth kappa
## parameters 0.9   0.9    0.01    DMA        reg 1              rec    <NA> 
##            omega m.prior DOW threshold DOW.nmods DOW.type
## parameters <NA>  0.5     0             <NA>      <NA>    
## 
##       model  naive auto ARIMA
## RMSE 0.2514 0.1545         NA
## MAE  0.1816 0.1015         NA
## 
## observations: 60
## models: 32
## variables (incl. constant): 6

plot(dma_model1, 1)

#dma_forecast_values <- forecast(dma_model1$y.hat , h = 4)
#autoplot(dma_forecast_values)

# point forecasts, exponentitated:
#exp(dma_forecast_values$mean)

p1 <- predict(dma_model1, newdata = xreg_matrix, type="backward") # consider forward
plot(exp(p1))

Applying a train/test split:

# Assuming you have a time series object ts_data and an xreg_matrix for the fDMA model

# Set the seed for reproducibility
set.seed(123)

# Determine the size of the training set (e.g., 80% for training, 20% for testing)
train_size <- floor(0.8 * length(ts_data))

# Generate a random index to split the time series
split_index <- sample(1:length(ts_data), train_size)

# Create training and test sets
train_data <- ts_data[split_index]
test_data <- ts_data[-split_index]
train_xreg <- xreg_matrix[split_index, ]
test_xreg <- xreg_matrix[-split_index, ]

# Fit fDMA model to the training set
dma_model1 <- fDMA(train_data, train_xreg, 0.90, 0.90, 0.01)

# Forecast using the fDMA model on the test set
dma_forecast <- predict(dma_model1, newdata = test_xreg, type = 'forward')

rmse_dma <- sqrt(mean((dma_forecast - test_data)^2))
cat("fDMA RMSE:", rmse_dma, "\n")

## fDMA RMSE: 0.3096359

6.2.2 eDMA package

The eDMA package uses a griid of values for lambda, which gives kmore flexibility to the forgetting factors in the forecasts.

#eDMA package
# uses grid of values for lambda
data_df <- data.frame(ts_data, xreg_matrix)

dma_model2 <- DMA(ts_data ~., data = data_df, vDelta = c(0.9, 0.95, 0.99), dAlpha = 0.99,
             vKeep = NULL, bZellnerPrior = FALSE, dG = 100, bParallelize = TRUE,
             iCores = NULL, dBeta = 1.0)

## Warning in funcEstimate_Eff_par(vY, mF, vDelta, dAlpha, vKeep, dBeta,
## bZellnerPrior, : OpenMP is not available. Sequential processing is used.

print(dma_model2)

## 
## ------------------------------------------
## -        Dynamic Model Averaging         -
## ------------------------------------------
## 
## Model Specification  
## T     = 60
## n     = 6
## d     = 3
## Alpha = 0.99
## Beta  = 1
## Model combinations = 63
## Model combinations including averaging over delta = 189
## ------------------------------------------
## Prior : Multivariate Gaussian with mean vector 0 and covariance matrix equal to: 100 x diag(6)
## ------------------------------------------
## The grid for delta:
## 
## Delta =  0.9, 0.95, 0.99
## ------------------------------------------
## 
## Elapsed time : 0.05 secs

summary(dma_model2)

## 
## Call:
##  DMA(formula =  ts_data ~ . )
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4497 -0.1070  0.0293  0.2033  1.1217 
## 
## Coefficients:
##                E[theta_t] SD[theta_t] E[P(theta_t)] SD[P(theta_t)]
## (Intercept)          2.73        2.35          0.59           0.13
## Price.NatGas         0.07        0.06          0.18           0.09
## Price.Crude         -0.04        0.07          0.14           0.09
## Price.Coal          -0.32        0.38          0.52           0.34
## NASDAQ100            0.25        0.23          0.45           0.24
## GDP.per.Capita       0.77        1.05          0.41           0.15
## 
## Variance contribution (in percentage points):
##   vobs vcoeff   vmod   vtvp 
##   0.10  99.88   0.01   0.00 
## 
## Top 10% included regressors: (Intercept)
## 
## Forecast Performance:
##                              DMA     DMS
## MSE                        0.066   0.095
## MAD                        0.212   0.256
## Log-predictive Likelihood -4.755 -10.093

# RMSE for eDMA 
sqrt(0.066)

## [1] 0.2569047

6.3 SVM

Another popular method for forecasting data that may be nonlinear and non stationary is a least squares support vector machine (LS-SVM) methodology. A novel neural network technique, LS-SVM may better capture non-linearity when compared to the popular ARIMA model.

# Build an SVM model
svm2 <- ksvm(xreg_matrix, y, type = "eps-svr", kernel = "rbfdot")

# Make predictions
Y_pred <- predict(svm2, xreg_matrix)

# Evaluate the model
sqrt(mean((Y_pred - y)^2))

## [1] 0.1108792

6.4 ARIMA-SVM

This is simply a combination of ARIMA and SVM model predictions, with a heavier weight placed on the more accurate SVM. It produces a slightly lower RMSE than the SVM alone.

An important factor in the combination of forecasts is the weights assigned to each forecast. Based on predicted (fitted) values provided by each forecast, I want to create a function that finds the optimal combination of weights based on the lowest RMSE, which is the error evaluation I have been using for all forecasts.

tune_weights <- function(y, arima_model2, Y_pred, max_iterations = 100, tolerance = 1e-6) {
  best_rmse <- Inf
  best_weights <- c()

  for (i in 1:max_iterations) {
    # Generating random weights
    weight_arima <- runif(1)
    weight_y_pred <- 1 - weight_arima

    # forecast combo!
    combined_forecast <- (weight_arima * arima_model2$fitted) + (weight_y_pred * Y_pred)

    # Evaluate the combined forecast (using RMSE)
    rmse <- sqrt(mean((combined_forecast - y)^2))

    # Update best weights if the current RMSE is lower
    if (rmse < best_rmse) {
      best_rmse <- rmse
      best_weights <- c(weight_arima, weight_y_pred)
    }

    # Check for convergence based on tolerance
    if (rmse < tolerance) {
      break
    }
  }

  # Normalize weights to ensure they sum to 1 for complete point forecasts
  best_weights <- best_weights / sum(best_weights)

  cat("Best Root Mean Squared Error (RMSE):", best_rmse, "\n")
  cat("Best Weights: ARIMA =", best_weights[1], ", Y_pred =", best_weights[2], "\n")

  return(best_weights)
}

# In action:
best_weights <- tune_weights(y, arima_model2, Y_pred)

## Best Root Mean Squared Error (RMSE): 0.1087847 
## Best Weights: ARIMA = 0.1838495 , Y_pred = 0.8161505

# Combine ARIMA and SVM forecasts manually with given weights
#combined_forecast <- (0.2886095 * arima_model2$fitted) + (0.7113905 * Y_pred) 

# Evaluate the combined forecast
#rmse <- sqrt(mean((combined_forecast - y)^2))
#cat("Combined Root Mean Squared Error (RMSE):", rmse, "\n")

# Forecast 4 periods into the future - Hyndman package, defaulting to ETS?
#forecast_values <- forecast(combined_forecast, h = 4)
#autoplot(forecast_values)

# point forecasts, exponentitated:
# exp(forecast_values$mean)

But- there may be a better way. ARIMA models are ideal for estimating linearity, with errors of ARIMA models necessarily white noise. Support vector machines, on the other hand, are best for nonlinearity and high dimensionality, and can potentially pick up on patterns on the white noise errors left behind by an ARIMA model. This method of a combined ARIMA-SVM has been used before, and could be beneficial here given the low RMSE of both ARIMA and SVM models on this dataset.

# Calculate residuals
residuals_arima <- y - cprice$fitted
residuals_arima <- as.vector(residuals_arima)

svm_model <- ksvm(xreg_matrix, residuals_arima, type = "eps-svr", kernel = "rbfdot")

svm_pred <- predict(svm_model)

combo2 <- (cprice$fitted + svm_pred)
sqrt(mean((combo2 - y)^2))

## [1] 0.4422394

# this seems way too high?