CUNY DATA624 Homework 2
CUNY DATA624 Homework 2
- Question 3.1
- Question 3.2
- Question 3.3
- Question 3.8 – For your retail time series (from Exercise 3 in Section 2.10):
- Split the data into two parts
- Check that your data have been split appropriately by producing the following plot.
- Calculate forecasts using
snaiveapplied tomyts.train. - Compare the accuracy of your forecasts against the actual values stored in
myts.test - Check the residuals.
- How sensitive are the accuracy measures to the training/test split?
Question 3.1
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.
I’ll create a function to identify the appropriate Box-Cox Transformation for each dataset.
Question 3.2
Why is a Box-Cox transformation unhelpful for the cangas data?
When you compare the two plots, it appears the transformation actually makes the series appear more complex. The original series appears slightly more linear whereas the Box-Cox transformation appears to have a more prominent curve. The spike around 1973 also is more prominent in the transformation making it more complex then the original.
Question 3.3
What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?
For exercise 3 in section 2.10, I selected the “Turnover ; New South Wales ; Food retailing” data.
The original plot looks fairly simple, but when we apply a Box-Cox power transformation, we notice the variation in the later years (about 2005 and later) is reduced.
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,"A3349398A"],
frequency=12, start=c(1982,4))
BC(myts)Question 3.8 – For your retail time series (from Exercise 3 in Section 2.10):
Split the data into two parts
Check that your data have been split appropriately by producing the following plot.
Calculate forecasts using snaive applied to myts.train.
Compare the accuracy of your forecasts against the actual values stored in myts.test
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 73.94114 88.31208 75.13514 6.068915 6.134838 1.000000 0.6312891
## Test set 115.00000 127.92727 115.00000 4.459712 4.459712 1.530576 0.2653013
## Theil's U
## Training set NA
## Test set 0.7267171
Check the residuals.
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 671.41, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
How sensitive are the accuracy measures to the training/test split?
In general, accuracy measures are very sensitive to the training/test split. This is typically validated with time series cross validation. We can use the tsCV function to do this. From the below, you can tell as the test set size gets larger, the forecast error also correspondingly gets larger.
e <- tsCV(myts, forecastfunction=snaive, h=8)
mse <- colMeans(e^2, na.rm = T)
data.frame(h = 1:8, MSE = mse) %>%
ggplot(aes(x = h, y = MSE)) + geom_point()