3.1 For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

  • usnetelec .5167714
  • usgdp .366352
  • mcopper .1919047
  • enplanements -.2269461
  • Below I plot each time series given its lambda transformation on the left, and its original distribution on the right
dframe <- cbind(Yearly = usnetelec,
                Month = usnetelec/12)
#autoplot(dframe, facet=TRUE) +
#    xlab("Years") + ylab("Pounds") +
#    ggtitle("Milk production per cow")

lambda1 <- BoxCox.lambda(usnetelec)
lambda2 <- BoxCox.lambda(usgdp)
lambda3 <- BoxCox.lambda(mcopper)
lambda4 <- BoxCox.lambda(enplanements)

plot_grid(autoplot(BoxCox(usnetelec,lambda1)),autoplot(usnetelec),autoplot(BoxCox(usgdp,lambda2)),autoplot(usgdp),labels = c('uselec Lambda trans', '                uselec','US GDP Lambda trans', '       US GDP'))

plot_grid(autoplot(BoxCox(mcopper,lambda3)),autoplot(mcopper),autoplot(BoxCox(enplanements,lambda4)),autoplot(enplanements),labels = c('Mccopper Lambda trans', '        Mccopper','enpl Lambda trans', '   enplanements')  )

  • The first two transformations(US GDP, useelec) seem to do a good job and reducing the variation in the time series. The 2nd two transformations(Mccoper,enpl)

3.2 Why is a Box-Cox transformation unhelpful for the cangas data?

  • The transformation does nothing to reduce the variation in the data. You can see that the area in the middle of the data set has abnormally large variation
#autoplot(cangas)
lambda <- BoxCox.lambda(cangas)
#autoplot(BoxCox(cangas,lambda))
#autoplot(cangas)
#checkresiduals(naive(usnetelec))


plot_grid(autoplot(BoxCox(cangas,lambda)),autoplot(cangas),labels = c('    Lambda trans Cangas', '       original'))

3.3 What Box-Cox transformation would you select for your retail data from Exercise 3?

  • Column I chose was New South Wales ; Other recreational goods retailing
    • lambda -.446
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,15],
  frequency=12, start=c(1982,4))
#myts

#autoplot(myts)
lambda <- BoxCox.lambda(myts)
#autoplot(BoxCox(myts,lambda))

plot_grid(autoplot(BoxCox(myts,lambda)),autoplot(myts),labels = c('Lambda trans', 'original')  )

3.8

A. Split data

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

Check Split

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

C Calculate forecast with snaive

fc <- snaive(myts.train)
fc
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 2011           97.7  89.37191 106.02809  84.96329 110.43671
## Feb 2011           74.6  66.27191  82.92809  61.86329  87.33671
## Mar 2011           88.2  79.87191  96.52809  75.46329 100.93671
## Apr 2011           88.7  80.37191  97.02809  75.96329 101.43671
## May 2011           83.5  75.17191  91.82809  70.76329  96.23671
## Jun 2011           83.4  75.07191  91.72809  70.66329  96.13671
## Jul 2011           90.8  82.47191  99.12809  78.06329 103.53671
## Aug 2011           85.1  76.77191  93.42809  72.36329  97.83671
## Sep 2011           91.4  83.07191  99.72809  78.66329 104.13671
## Oct 2011          104.4  96.07191 112.72809  91.66329 117.13671
## Nov 2011          112.9 104.57191 121.22809 100.16329 125.63671
## Dec 2011          200.6 192.27191 208.92809 187.86329 213.33671
## Jan 2012           97.7  85.92231 109.47769  79.68757 115.71243
## Feb 2012           74.6  62.82231  86.37769  56.58757  92.61243
## Mar 2012           88.2  76.42231  99.97769  70.18757 106.21243
## Apr 2012           88.7  76.92231 100.47769  70.68757 106.71243
## May 2012           83.5  71.72231  95.27769  65.48757 101.51243
## Jun 2012           83.4  71.62231  95.17769  65.38757 101.41243
## Jul 2012           90.8  79.02231 102.57769  72.78757 108.81243
## Aug 2012           85.1  73.32231  96.87769  67.08757 103.11243
## Sep 2012           91.4  79.62231 103.17769  73.38757 109.41243
## Oct 2012          104.4  92.62231 116.17769  86.38757 122.41243
## Nov 2012          112.9 101.12231 124.67769  94.88757 130.91243
## Dec 2012          200.6 188.82231 212.37769 182.58757 218.61243

D Compare accuracy

accuracy(fc,myts.test)
##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 2.772372 6.498441 4.897297 4.654289 8.011653 1.000000
## Test set     4.358333 8.961771 7.266667 4.197187 6.714843 1.483812
##                   ACF1 Theil's U
## Training set 0.7123920        NA
## Test set     0.1026843 0.3872304

E. Check residuals

  • The residuals seem to really start violating constant variance after 2005
  • The lag chart and Ljung-box test show that the data is clearly still auto correlated
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 553.11, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

F. Explore relationship between test train split

  • To explore this relationship I decided to split train test data into 3 years out, 2 years out and 1 year out

Three years out

  • I show code for three years out and leave it out for the other 2 splits, as it’s same code
## 2011 forward
myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)
fc <- snaive(myts.train)
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 553.11, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24
myts.train %>% ets %>% forecast(h=length(myts.test),PI
=FALSE) -> fc
three_years_out <- autoplot(fc) + autolayer(myts.test)
twenty_11_forward <- round(accuracy(fc,myts.test),2)
three_years_out

two years out

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 568.78, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

One year out

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 549.26, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 2013 116.0  90.6  96.3 112.9 100.8  93.1 101.9 109.1 107.6 122.7 154.8
##        Dec
## 2013 237.1

Compare accuracy scores

# test reuslts
accuracy_table <- as.data.frame(rbind(twenty_11_forward[2,1:8],twenty_12_forward[2,1:8],twenty_13_forward[2,1:8]))
rownames(accuracy_table) <- c('three_yearsout',"two_years_out","one_year_out")
kable(accuracy_table,caption= 'Comparison of train/test split accuracy') 
Comparison of train/test split accuracy
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
three_yearsout 1.99 11.81 8.83 0.46 7.75 1.80 0.45 0.48
two_years_out -5.94 12.97 10.11 -6.55 9.55 2.03 0.38 0.67
one_year_out 17.17 20.87 17.17 13.26 13.26 3.38 0.37 0.81

One last test for train test split

  • it Looks like the variation in the data set clearly changes overtime. I wonder if i just use the past couple years to train, if our model will have better accuracy
  • the model seems to have avoided auto correlation issues, but the accuracy is still largely in line with the other train test split models
myts.train <- window(myts, start=c(2010,1),end=2013)
myts.test <- window(myts, start=2013)
fc <- snaive(myts.train)
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 12.239, df = 7.4, p-value = 0.1107
## 
## Model df: 0.   Total lags used: 7.4
myts.train %>% ets %>% forecast(h=length(myts.test),PI
=FALSE) -> fc
most_recent_data_only <- autoplot(fc) + autolayer(myts.test)
most_recent_data_only <- round(accuracy(fc,myts.test),2)[2,1:8]
twenty_13_forward
##                 ME  RMSE   MAE   MPE  MAPE MASE ACF1 Theil's U
## Training set  0.08  4.23  3.01 -0.26  4.63 0.59 0.13        NA
## Test set     17.17 20.87 17.17 13.26 13.26 3.38 0.37      0.81
accuracy_table <- rbind(accuracy_table,most_recent_data_only)
rownames(accuracy_table)[4] <-"most_recent_data_only" 
kable(accuracy_table,caption= 'Comparison of train/test split accuracy') 
Comparison of train/test split accuracy
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
three_yearsout 1.99 11.81 8.83 0.46 7.75 1.80 0.45 0.48
two_years_out -5.94 12.97 10.11 -6.55 9.55 2.03 0.38 0.67
one_year_out 17.17 20.87 17.17 13.26 13.26 3.38 0.37 0.81
most_recent_data_only 17.46 20.31 17.46 13.73 13.73 2.28 0.34 0.84

Conclusion

  • Overall our accuracy scores seem to be extremely dependent on whatever data points we are using. This likely explains why we need to perform cross validation in order to really test our time series data.