DATA 624: Homework 2
lambda <- BoxCox.lambda(usnetelec)
dframe <- cbind(RawSeries = usnetelec, TransformedData = BoxCox(usnetelec,lambda))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("billion kwh") +
ggtitle("Annual US net electricity generation, \n power parameter lambda = ", round(lambda,5))
lambda <- BoxCox.lambda(usgdp)
dframe <- cbind(RawSeries = usgdp, TransformedData = BoxCox(usgdp,lambda))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("US Dollars") +
ggtitle("Quarterly US GDP, \n power parameter lambda = ", round(lambda,5))
lambda <- BoxCox.lambda(mcopper)
dframe <- cbind(RawSeries = mcopper, TransformedData = BoxCox(mcopper,lambda))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("pounds per ton") +
ggtitle("Monthly copper prices, \n power parameter lambda = ", round(lambda,5))
lambda <- BoxCox.lambda(enplanements)
dframe <- cbind(RawSeries = enplanements, TransformedData = BoxCox(enplanements,lambda))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("millions") +
ggtitle("US domestic enplanements, \n power parameter lambda = ", round(lambda,5))
In the monthly Canadian gas production time series, the variation in data does not increase or decrease according to the level of the series, and as such the Box-Cox transformation is not applicable. The largest variation occurs at the mid-levels.
lambda <- BoxCox.lambda(cangas)
dframe <- cbind(RawSeries = cangas, TransformedData = BoxCox(cangas,lambda))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("billion cubic metres") +
ggtitle("Monthly Canadian gas production, \n power parameter lambda = ", round(lambda,5))
retaildata <- readxl::read_excel('retail.xlsx', skip=1)
myts <- ts(retaildata[,"A3349824F"], frequency=12, start=c(1982,4))
lambda <- BoxCox.lambda(myts)
dframe <- cbind(RawSeries = myts, TransformedData = BoxCox(myts,lambda))
autoplot(dframe, facet=TRUE) +
xlab("Years") + ylab("Sales") +
ggtitle("Australian Retail Data (ID: A3349824F), \n power parameter lambda = ", round(lambda,5))
myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)
autoplot(myts) +
autolayer(myts.train, series="Training") +
autolayer(myts.test, series="Test")
fc <- snaive(myts.train)
accuracy(fc,myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 12.11051 25.24173 18.71471 6.299482 8.728705 1.000000
## Test set 66.10833 73.22137 66.10833 14.002564 14.002564 3.532425
## ACF1 Theil's U
## Training set 0.7639057 NA
## Test set 0.7434153 1.724674
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 977.76, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
Do the residuals appear to be uncorrelated and normally distributed?
No on both counts. There is positive (and decreasing) autocorrelation according to the ACF curve for the first 11 lags, and the residual histogram shows a rightward positive skew. According to the Ljung-Box test, large values of Q* (here, 977.76) suggest that the autocorrelations do not come from a white noise series.
In the first case, we have performed an in-sample forecast with the last three years of monthly sales data in the test set. The RMSE was 73.22137 and MASE 3.532425.
accuracy(fc,myts.test)[,c(2,6)][1,]
## RMSE MASE
## 25.24173 1.00000
head(myts) #Time Series begins (1982,4)
## Apr May Jun Jul Aug Sep
## 1982 50.4 57.9 54.4 56.2 55.0 55.6
tail(myts) #Time Series begins (2013,12)
## Jul Aug Sep Oct Nov Dec
## 2013 486.8 503.6 494.3 541.3 561.2 643.6
(dates <- length(myts))
## [1] 381
Below, we calculate the root squared mean error and mean absolute error for different training-test splits.
years <- 1983:2012
df <- NULL
for (i in years) {
myts.train <- window(myts, end=c(i,12))
myts.test <- window(myts, start=i+1)
fc <- snaive(myts.train)
acc <- accuracy(fc,myts.test)[,c(1,2)][2,]
df <- cbind(df, acc)
}
colnames(df) <- years
df2 <- data.frame(t(df))
pct_test <- round((2013-years) * 12 / dates,3)
df2 <- cbind(df2, pct_test)
print(df2)
## ME RMSE pct_test
## 1983 14.8166667 16.037092 0.945
## 1984 10.5333333 11.979357 0.913
## 1985 11.7750000 14.136271 0.882
## 1986 11.4333333 12.404166 0.850
## 1987 10.0416667 13.133228 0.819
## 1988 10.1291667 10.752383 0.787
## 1989 1.1125000 7.367638 0.756
## 1990 15.3500000 24.018812 0.724
## 1991 44.3875000 48.218388 0.693
## 1992 32.8166667 35.544819 0.661
## 1993 14.0375000 19.275513 0.630
## 1994 13.8041667 21.997566 0.598
## 1995 24.3125000 30.123572 0.567
## 1996 31.8541667 35.545563 0.535
## 1997 11.4416667 18.728432 0.504
## 1998 21.5875000 28.649920 0.472
## 1999 9.5000000 16.030362 0.441
## 2000 -6.0875000 18.425695 0.409
## 2001 20.2500000 31.688628 0.378
## 2002 59.8458333 71.808545 0.346
## 2003 68.4000000 71.109452 0.315
## 2004 50.3625000 56.312073 0.283
## 2005 52.7333333 55.954528 0.252
## 2006 0.6583333 30.529644 0.220
## 2007 -30.6250000 35.064583 0.189
## 2008 2.5500000 18.527007 0.157
## 2009 8.8500000 32.502820 0.126
## 2010 66.1083333 73.221371 0.094
## 2011 59.4791667 64.331365 0.063
## 2012 28.2583333 33.297710 0.031
attach(df2)
## The following object is masked _by_ .GlobalEnv:
##
## pct_test
plot(x=pct_test, y=RMSE)
cor(pct_test, RMSE)
## [1] -0.6200111
We get the opposite result as we would expect, with a moderately strong negative correlation between test/train split and RMSE. As the percentage of test data increases versus trained data, we should see increasing error, but that is not the case.
plot(x=pct_test, y=ME)
cor(pct_test, ME)
## [1] -0.2830965
Here there again is a negative correlation between test/train split and mean absolute error, though not as strong as with RMSE. Again we would expect a positive relationship.
The author believes this surprising finding is related to the pecularity of this particular time series, in which the early values have low volatility and are close to the overall mean. As the time series progresses, volatility exploded leading to distortions in the forecasts.