Time Series02 Assignment

0.1 Data Description

This data set, retrieved from datahub.io, is a time series of major Natural Gas Prices including US Henry Hub. Data comes from U.S. Energy Information Administration EIA Data Dataset contains Monthly and Daily prices of Natural gas, starting from January 1997 to current year. Prices are in nominal dollars.

library(readr)
gas.price <- read_csv("monthly_csv.csv")

## Rows: 284 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

n.row = dim(gas.price)[1]
data.gas.price = gas.price[(n.row-150):n.row, ]

0.2 Define Time Series Object

Since this is monthly data, frequency =12 will be used the define the time series object.

gasprice.ts = ts(data.gas.price[,2], frequency = 12, start = c(2008, 1))
par(mar=c(2,2,2,2))
plot(gasprice.ts, main="Natural Gas Prices Between Feb, 2008 and Aug, 2020", ylab="Monthly Rate", xlab="")

0.3 Forecasting with Decomposing

Notice that the classical decomposition method does not work as well as the STL method due to the robustness of the LOESS component. The following visual representations show the different behaviors of the two methods of decomposition.

cls.decomp = decompose(gasprice.ts)
par(mar=c(2,2,2,2))
plot(cls.decomp, xlab="")

#stl.decomp=stl(gasprice.ts, s.window = "12")
#par(mar=c(2,2,2,2))
#plot(stl.decomp)

For whatever reason I could not get this code to run as I would receive an error message saying that only univariate series are allowed. Looking at the data and comparing it to that used in the case study, I am not sure what differentiates them.

0.4 Training and Testing Data

We hold up the last 7 periods of data for testing. The rest of the historical data will be used to train the forecast model.

To evaluate the effect of different sizes in training the time series, We define different training data sets with different sizes. Three training set sizes used in this example are 144, 109, 73, and 48. The same test set with size 7 will be used to calculate the prediction error.

ini.data = data.gas.price[,2]
n0 = length(ini.data)
##
train.data01 = data.gas.price[1:144, 2]
train.data02 = data.gas.price[37:144, 2]
train.data03 = data.gas.price[73:144, 2]
train.data04 = data.gas.price[97:144, 2]
## last 7 observations
test.data = data.gas.price[145:151,2]
##
train01.ts = ts(train.data01, frequency = 12, start = c(2008, 1))
train02.ts = ts(train.data02, frequency = 12, start = c(2011, 1))
train03.ts = ts(train.data03, frequency = 12, start = c(2014, 1))
train04.ts = ts(train.data04, frequency = 12, start = c(2016, 1))
##
#stl01 = stl(train01.ts, s.window = 12)
#stl02 = stl(train02.ts, s.window = 12)
#stl03 = stl(train03.ts, s.window = 12)
#stl04 = stl(train04.ts, s.window = 12)
## Forecast with decomposing
#fcst01 = forecast(stl01,h=7, method="naive")
#fcst02 = forecast(stl02,h=7, method="naive")
#fcst03 = forecast(stl03,h=7, method="naive")
#fcst04 = forecast(stl04,h=7, method="naive")

Looking at my data set, I cannot discern the reason why n0 was returning a length of 1, despite looking nearly identical to the case study data. After attempts at troubleshooting the error with the stl function, I could not figure out a solution.

0.5 Error Analysis

## To compare different errors, we will not use the percentage for MAPE
#PE01=(test.data-fcst01$mean)/fcst01$mean
#PE02=(test.data-fcst02$mean)/fcst02$mean
#PE03=(test.data-fcst03$mean)/fcst03$mean
#PE04=(test.data-fcst04$mean)/fcst04$mean
###
#MAPE1 = mean(abs(PE01))
#MAPE2 = mean(abs(PE02))
#MAPE3 = mean(abs(PE03))
#MAPE4 = mean(abs(PE04))
###
#E1=test.data-fcst01$mean
#E2=test.data-fcst02$mean
#E3=test.data-fcst03$mean
#E4=test.data-fcst04$mean
##
#MSE1=mean(E1^2)
#MSE2=mean(E2^2)
#MSE3=mean(E3^2)
#MSE4=mean(E4^2)
###
#MSE=c(MSE1, MSE2, MSE3, MSE4)
#MAPE=c(MAPE1, MAPE2, MAPE3, MAPE4)
#accuracy=cbind(MSE=MSE, MAPE=MAPE)
#row.names(accuracy)=c("n.144", "n.109", "n. 73", "n. 48")
#kable(accuracy, caption="Error comparison between forecast results with different sample sizes")

#plot(1:4, MSE, type="b", col="darkred", ylab="Error", xlab="",
#     ylim=c(0.4,.85),xlim = c(0.5,4.5), main="Error Curves", axes=FALSE)
#labs=c("n=144", "n=109", "n=73", "n=48")
#axis(1, at=1:4, label=labs, pos=0.4)
#axis(2)
#lines(1:4, MAPE, type="b", col="blue")
#text(1:4, MAPE+0.03, as.character(round(MAPE,4)), col="blue", cex=0.7)
#text(1:4, MSE-0.03, as.character(round(MSE,4)), col="darkred", cex=0.7)
#legend(1.5, 0.63, c("MSE", "MAPE"), col=c("darkred","blue"), lty=1, bty="n", cex=0.7)

Though I cannot run this code, I understand that the training size with the lowest MAPE and MSE error numbers yields the best performance.