DATA624

Do exercises 3.1, 3.2, 3.3 and 3.8 from the online Hyndman book. Please include your Rpubs link along with your .rmd file.

library(fpp2)
library(ggplot2)
library(knitr)
library(kableExtra)

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

usnetelec

usgdp

mcopper

enplanements

Exploring usnetelec

Annual US net electricity generation (billion kwh) for 1949-2003

#help(usnetelec)
head(usnetelec)

## Time Series:
## Start = 1949 
## End = 1954 
## Frequency = 1 
## [1] 296.1 334.1 375.3 403.8 447.0 476.3

summary(usnetelec)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   296.1   889.0  2040.9  1972.1  3002.7  3858.5

Plotting the time series, we can see that the frequency is yearly.

autoplot(usnetelec)

lambda <- BoxCox.lambda(usnetelec)
print(paste0("Lambda for usnetelec: ", lambda))

## [1] "Lambda for usnetelec: 0.516771443964645"

Exmplorin usgdp

Quarterly US GDP. 1947:1 - 2006.1.

#help("usgdp")
head(usgdp)

##        Qtr1   Qtr2   Qtr3   Qtr4
## 1947 1570.5 1568.7 1568.0 1590.9
## 1948 1616.1 1644.6

summary(usgdp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1568    2632    4552    5168    7130   11404

lambda <- BoxCox.lambda(usgdp)
print(paste0("Lambda for usgdp: ", lambda))

## [1] "Lambda for usgdp: 0.366352049520934"

autoplot(BoxCox(usgdp,lambda))

Exmplorin mcopper

Monthly copper prices. Copper, grade A, electrolytic wire bars/cathodes,LME,cash (pounds/ton)
Source: UNCTAD (http://stats.unctad.org/Handbook).

#help("mcopper")
head(mcopper)

##        Jan   Feb   Mar   Apr   May   Jun
## 1960 255.2 259.7 249.3 258.0 244.3 246.8

summary(mcopper)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   216.6   566.0   949.2   997.8  1262.5  4306.0

lambda <- BoxCox.lambda(mcopper)
print(paste0("Lambda for mcopper: ", lambda))

## [1] "Lambda for mcopper: 0.191904709003829"

autoplot(BoxCox(mcopper,lambda))

daily_average <- mcopper/monthdays(mcopper)
copper_df <- cbind(Monthly = mcopper,
                daily_average = daily_average)
  autoplot(copper_df, facet=TRUE) +
    xlab("Years") + ylab("Copper Price") +
    ggtitle("Monthly Copper Price")

the BoxCox transformation lambda is ~ 0.2 .

Exmplorin enplanements

“Domestic Revenue Enplanements (millions): 1996-2000. SOURCE: Department of Transportation, Bureau of Transportation Statistics, Air Carrier Traffic Statistic Monthly.

#help("enplanements")
head(enplanements)

##        Jan   Feb   Mar   Apr   May   Jun
## 1979 21.12 22.92 25.90 24.38 23.41 26.82

summary(enplanements)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.14   27.18   34.88   35.67   42.78   56.14

lambda <- BoxCox.lambda(enplanements)
print(paste0("Lambda for enplanements: ", lambda))

## [1] "Lambda for enplanements: -0.226946111237065"

autoplot(BoxCox(enplanements,lambda))

daily_average <- mcopper/monthdays(enplanements)
enp_df <- cbind(Monthly = enplanements,
                daily_average = daily_average)
  autoplot(enp_df, facet=TRUE) +
    xlab("Years") + ylab("Domestic Revenue Enplanements (millions)") +
    ggtitle("Monthly US domestic enplanements")

ggseasonplot(enplanements)

gglagplot(enplanements)

ggAcf(enplanements)

There appears to be an upward trend on the ACF plot, also, there appears to be some seasonal component with the “scallop”" shape.

Why is a Box-Cox transformation unhelpful for the cangas data?

Monthly Canadian gas production, billions of cubic metres, January 1960 - February 2005

#help("cangas")
head(cangas)

##         Jan    Feb    Mar    Apr    May    Jun
## 1960 1.4306 1.3059 1.4022 1.1699 1.1161 1.0113

summary(cangas)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.966   6.453   8.831   9.777  14.429  19.528

autoplot(cangas)

lambda <- BoxCox.lambda(cangas)
print(paste0("Lambda for cangas: ", lambda))

## [1] "Lambda for cangas: 0.576775938228139"

autoplot(BoxCox(cangas,lambda))

Based upon the BoxCox.lambda function it appears that a function/ model may suggesting a transformation that it does not necessarily mean. Also, it does not yield simpler model.

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata <- readxl::read_excel("C:\\NITEEN\\GitHub\\CUNY_DATA_624\\Dataset\\retail.xlsx", skip=1)
#kable(head(retaildata))

myts <- ts(retaildata$A3349661X, start = c(1982,4),frequency = 12)
head(myts)

##       Apr  May  Jun  Jul  Aug  Sep
## 1982 19.2 21.9 19.9 19.3 19.6 19.9

autoplot(myts)

lambda <- BoxCox.lambda(myts)
print(paste0("Lambda for myts: ", lambda))

## [1] "Lambda for myts: -0.0493897710767389"

autoplot(BoxCox(myts,lambda))

It appears that log transformation is appropriate.

For your retail time series (from Exercise 3 in Section 2.10):

Split the data into two parts using

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myts) + 
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

Calculate forecasts using snaive applied to myts.train.

fc <- snaive(myts.train)

Compare the accuracy of your forecasts against the actual values stored in myts.test

accuracy(fc,myts.test)

##                     ME     RMSE       MAE       MPE     MAPE     MASE
## Training set  3.518619 10.42741  7.244144  5.375707 11.81551 1.000000
## Test set     30.512500 33.10478 30.512500 20.388974 20.38897 4.212023
##                   ACF1 Theil's U
## Training set 0.7935927        NA
## Test set     0.3613754   2.66608

When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.

Check the residuals.

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 1059, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Do the residuals appear to be uncorrelated and normally distributed?

The residuals seem correlated with each other but it doesn’t look like normally distributed

How sensitive are the accuracy measures to the training/test split?

Per lesson 3.4, the accuracy measure is always sensitive to the training/test split. The following points should be noted.

A model which fits the training data well will not necessarily forecast well.
A perfect fit can always be obtained by using a model with enough parameters.
Over-fitting a model to data is just as bad as failing to identify a systematic pattern in the data.

We can also use cross-validation tsCV() to check the accuracy of a time series. In this procedure, there are a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set. Thus, no future observations can be used in constructing the forecast. Time series cross-validation is implemented with the tsCV() function.

Also, It looks like the Mean Error is highly sensitive by definition. RMSE, MAE, MPE, MASE are sensitive too whereas MAPE and ACF1 are not that sensitive.

DATA624_HW2

Niteen Kumar

February 16, 2019