HWK2

624 Predictive Analysis -time series

HA 3.1

For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance. * usnetelec: * usgdp: * mcopper: * enplanements:

library(gridExtra)

transform.boxcox = function(ts, objname)
{
  lambda = BoxCox.lambda( ts)
  print(paste('Lambda value for time series', objname, '=', round(lambda, digits=3)))

  before = autoplot(ts) + labs(title = paste ("Original (Before Transformation)", objname))
  after = autoplot(BoxCox(ts, lambda)) +
    xlab('Time') + ylab('Value') +
   labs (   title= (paste('Labmda for ', objname, '=', round(lambda, digits=2) ) ),
    subtitle = ('Original vs Transformed plot') )
 
  par(mfrow=c(1,2))
    grid.arrange(before,after) 
}
transform.boxcox (usnetelec,  "Annual US Electricity")

## [1] "Lambda value for time series Annual US Electricity = 0.517"

 transform.boxcox(usgdp, 'US GDP Quarterly')

## [1] "Lambda value for time series US GDP Quarterly = 0.366"

 transform.boxcox(mcopper, 'Monthly Copper PRice')

## [1] "Lambda value for time series Monthly Copper PRice = 0.192"

 transform.boxcox(enplanements, 'US Enplanment Monthly')

## [1] "Lambda value for time series US Enplanment Monthly = -0.227"

All the four sets of the data are now transformed by box-Cox, and their corresponding Lambda is calculated. Here we can see that the four data sets starts with a large Lambda of 0.5 in US electricity, 0.36 in US GDP data, then decrease to 0.2 in copper data, and -0.227 in US enplanement data. Judging from the Lambda itself, that means uh bigger need for the transformation as Lambda goes down.

We plotted the before and after transformation time serious plot. As we can see that the transformation do not help much in US electricity data, as there was not much variation to begin with. similar case is for US GDP data.

The transformation helped the copper dataset in stabilizing the Variation, and they really helped in the US enplanement data in stabilize its variation.

3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

transform.boxcox(cangas, 'Ganandian Gas Production by Months')

## [1] "Lambda value for time series Ganandian Gas Production by Months = 0.577"

The Canadian gas production data, with Lambda of 0.58 (indicating minimum effect of box Cox transformation).

By plotting out the raw data, we can see that the variation is not even throughout the years. The data can be divided into 3 parts, 1960 two 1975, with relatively low variation. Second part is 9075 to 1990, there are significant variations each year. The final part is 1990 2 2000 10. When the variation decrease again compared to the middle part. In such case, the box Cox transformation is not helpful.

Mathematical transformations are helpful “If the data show variation that increases or decreases with the level of the series”. BoxCox transformation is helpful for a time series where variation in the time series changes with time, when BoxCox transformation is used to make those variations uniform over the period of time. The fact that the seasonal variation of the data goes up (but steady each year) in the first 10 years, then jump (also steady each year) in the middle of 10-15 years, finally goes down again (also steady each year) does not help in the mathematical transformation, which will NOT make result in a linear transformation term, rather, a quadratic term.

Therefore, box cox transformation is not helpful in this data/

3.3

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata <- readxl::read_excel("retail.xlsx", skip=1)

myts <- ts(retaildata[,"A3349627V"],
  frequency=12, start=c(1982,4))

transform.boxcox(myts, 'South Wales Store Retail Annually')

## [1] "Lambda value for time series South Wales Store Retail Annually = -0.058"

We can see from the before and after plots above that a Box-Cox transformation with -0.06 on the retail series is helpful in stabilizing the seasonal variance.In this data set, it follows the rule of variation increases overtime which deem some mathematical transformation.

After transformation, we can see that the peak variation of the later years are much more diminished and less pronounced in the transformed plot. So this transformation has make that time series more uniform over the year, and stabilize the data for further analysis.

3.8

For your retail time series (from Exercise 3 in Section 2.10):

3.8a.

Split the data into two parts using:

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

3.8b.

Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
   autolayer(myts.train, series="Training Data") +
   autolayer(myts.test, series="Testing Data")

print (paste ('Last Few Months of Training Data ') )

## [1] "Last Few Months of Training Data "

tail (myts.train)

##        Jul   Aug   Sep   Oct   Nov   Dec
## 2010 206.7 212.9 224.3 238.7 251.2 380.0

print (paste ('First Few Months of Training Data') )

## [1] "First Few Months of Training Data"

head(myts.test)

##        Jan   Feb   Mar   Apr   May   Jun
## 2011 250.0 216.8 240.5 244.4 232.1 223.6

3.8c.

Calculate forecasts using snaive applied to myts.train.

fc <- snaive(myts.train)

3.8d.

Compare the accuracy of your forecasts against the actual values stored in myts.test.

accuracy(fc,myts.test)

##                     ME     RMSE       MAE       MPE      MAPE    MASE      ACF1
## Training set  6.870871 12.27525  8.893093  5.476112  7.780981 1.00000 0.6617306
## Test set     28.400000 29.39091 28.400000 11.015822 11.015822 3.19349 0.5697915
##              Theil's U
## Training set        NA
## Test set     0.7493485

3.8e.

Check the residuals.Do the residuals appear to be uncorrelated and normally distributed?

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 591.71, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

From the ACF plot above , we see that residuals are highly correlated. The ACF figure shows that most of the variations our outside of the blue lines, which means the residuals I’m not normally distributed. The shorter lag especially have a correlation, which is around 0.6.

Historgram show that the residuals are not normally distributed at all.Histogram show that It does not have a mean of 0, rather, the mean is at 50, also it is skewed to the right, which has a longer right tail.

The Ljung-Box test has a highly significant P value (<0.00001), which means that the Season naive model is not taming the residuals at all, and the data is NOT ready to be further analyzed, as it still have too much autocorrelatiom and residual effect, which will make any analysis very biased.

3.8f.

How sensitive are the accuracy measures to the training/test split?

e <- tsCV(myts, forecastfunction=snaive, h=10)
RMSE = sqrt(mean(e^2, na.rm=TRUE))
# RMSE$outcome<-'RMSE'
rmse<-data.frame(h = 1:10, RMSE = RMSE) 
 rmse.plot <-   rmse %>%  ggplot(aes(x = h, y = RMSE)) + 
    geom_point() + ggtitle ('Root Mean Square Error' )

mse <- colMeans(e^2, na.rm = T)
mse<-data.frame(h = 1:10, MSE = mse) 
# mse$outcome<-'MSE'

 mse.plot <-   mse %>%  ggplot()+
  geom_point ( mapping =aes(x = h, y = MSE ) )+ 
                 ggtitle ('Mean Square Error Plot' )
 
 gridExtra::grid.arrange(rmse.plot, mse.plot, ncol=2)

  # plot.ts(cbind(mse, rmse))  ## plot not explanable

We computed the Root mean square of error for the 10 prediction period and plotted out. But since the y axis is not able to differentiate it visually, we also did the Mean Square error just for the visualization.

We can see that root mean square value with cross validation is around 160, but MSE is from 150 from 1 prediction period ahead, and gradually increase to 170, for the 10th prediction period ahead. Plot is able to show it visually.

Huge RMSE difference between cross validation and train test split approach indicates that accuracy measure of the above time series is vary sensitive to training/test split