Libraries Used

library(ggplot2)
library(tseries)
library(forecast)

Data Introduction

The dataset Demand.txt includes 24 years (Jan. 1992 - Dec. 2015) of total monthly number of customers having business with a financial firm. Interest lies in performing a complete exploratory and model fitting analysis.

We will not include the last two years (Dec.2013-Dec.2015) in the exploratory process or evaluation of the models. These last two years will be used as a validation set against our models.

Problem 1a.) Check for non-constant variance

We first will visualize the time series.

#read in data
data <- read.csv('Demand.txt', header=FALSE)
dates = seq(as.Date("1992/1/1"), as.Date("2015/12/1"), by='month')
data$DATES = dates
data$Months = months(data$DATES)

ggplot(data[1:264,], aes(DATES, V1)) + geom_line() + scale_x_date('Time') + 
  ylab('Monthly # of Customers') + xlab('') + ggtitle('Original TS (Except Last Two Years)')
Figure 1

Figure 1

myts <- ts(data$V1[1:264], start=c(1992, 1),frequency=12)
all.ts <- ts(data$V1, start=c(1992, 1),frequency=12)

We will comment on any general or seasonal trends in part (b.).

In this part (a.) we will discuss if the variance is constant or not.

For this time series of monthly # of customers for a financial firm we see in Figure 1 that the variance is almost constant. In other words, the fluctuations between minimum and maximum values are not changing over time. Thus, it is unlikely we will need to apply any variance-stabilization techniques. Just for completeness we will check if R recommmends a variance-stabilization technique by the BoxCox.lambda() function. If this function returns a value close to 1 then that suggests no transformation is needed to stabilize the variance.

BoxCox.lambda(myts)
## [1] 0.8943227

Therefore, it is reasonable to believe that no variance-stabilization technique is needed.

We will now discuss the stationarity of the original series by analyzing the sample ACF plot and utilizing some statistical tests.

par(mfrow=c(1,2))
Acf(myts, main='ACF of Orig.TS')
Pacf(myts, main='PACF of Orig.TS')
Figure 2

Figure 2

The sample ACF plot is converging slowly to the standard normal bounds. This indicates the original time series is not stationary.

There are two statistical tests we can use to check if the series if stationary: ADF and KPSS tests.

adf.test(myts, alternative='stationary') #null hypothesis not rejected
## 
##  Augmented Dickey-Fuller Test
## 
## data:  myts
## Dickey-Fuller = -1.6895, Lag order = 6, p-value = 0.7067
## alternative hypothesis: stationary
kpss.test(myts, null='Level') #null hypothesis rejected. 
## Warning in kpss.test(myts, null = "Level"): p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  myts
## KPSS Level = 2.8897, Truncation lag parameter = 5, p-value = 0.01

As we can see from the console output, both the KPSS and ADF tests conclude that the series is not stationary. In other words, our auto-correlation is a function of both lag \(h\) and time \(t\).

Problem 1c.)

Let us look at our current differenced lag 12 and \(d=1\) series again.

ggtsdisplay(y.rdiff.sdiff)

AR Terms - non-seasonal and seasonal

Looking at the PACF plot we can suggest a value of \(p=2\) due to the 2 spikes above the confidence interval region after the initial spike at lag 0. Furthermore, since there is 1 significant spike at lag 12 we can suggest that \(P = 1\).

MA Terms - non-seasonal and seasonal

Looking at the ACF plot we can suggest a value of \(q=2\) due to the 2 spikes above the confidence interval region after the initial spike at lag 0. Furthermore, since there is 1 significant spike at lag 12 we can suggest that \(Q=1\)

d and D - non-seasonal and seasonal differencing

Our \(d=1\) because of the regular differencing we already took in part (b.). Furthermore, since we also used a lagged difference of 12, this suggests that \(D=1\)

Therefore, so far we have suggested \((p,d,q)x(P,D,Q) = (2,1,2)\)x\((1, 1, 1)[12]\)

Problem 1d.)

We will fit three models to the original time series.

Fit1: SARIMA(2,1,2)x(1,1,1)[12]

Fit2: SARIMA(1,2,1)x(1,1,1)[12]

Fit3: R auto.arima() function will search for the hyperparameters.

Fit4: R auto.arime() function with seasonal=FALSE

Fit1: SARIMA(2,1,2)x(1,1,1)[12]

fit1 <- Arima(myts, order=c(2,1,2), seasonal=c(1,1,1),lambda=NULL, include.constant=TRUE)

resids.fit1 <- residuals(fit1)
plot(resids.fit1)
abline(h=0, lty=2)

qqnorm(resids.fit1)
qqline(resids.fit1)

Acf(resids.fit1, main='ACF of Resids F1')

This fit1 model seems promising. The variance of the residuals is not perfect, but that is really only due to the one large deviance around 1997. Otherwise the residuals appear to have constant-variance that does not depend on the time. Furthermore, the residuals are approxiamtely normally distributed per the qqplot. Lastly, the residuals appear to be uncorrelated to the ACF plot converging to zero rather quickly.

Fit2: SARIMA(1,2,1)x(1,1,1)[12]

fit2 <- Arima(myts, order=c(1,2,1), seasonal=c(1,1,1),lambda=NULL, include.constant=TRUE)

resids.fit2 <- residuals(fit2)
plot(resids.fit2)
abline(h=0, lty=2)

qqnorm(resids.fit2)
qqline(resids.fit2)

Acf(resids.fit2, main='ACF of Resids F2')

This fit2 model seems less promising than fit1 model for the following reasons:

The qqplot suggests these residuals are not as normal as the previous fit1. While the variance is nearly non-constant (same spike around 1997 as before), normality of the residuals is an important assumption for model validity. Lastly, the residuals appear to be correlated due to the spike in the ACF at lag 6.

Fit3: R auto.arima() function

fit3 <- auto.arima(myts)
fit3
## Series: myts 
## ARIMA(1,1,0)(2,0,0)[12] 
## 
## Coefficients:
##          ar1    sar1    sar2
##       0.1878  0.3258  0.3984
## s.e.  0.0610  0.0580  0.0610
## 
## sigma^2 estimated as 5.228:  log likelihood=-593.36
## AIC=1194.72   AICc=1194.88   BIC=1209.01

R’s auto.arima() function will search for the hyperparameters. Ultimately the search decided on the following SARIMA model:

SARIMA(1,1,0)x(2,0,0)[12]

We now look at the residuals.

resids.fit3 <- residuals(fit3)
plot(resids.fit3)
abline(h=0, lty=2)

qqnorm(resids.fit3)
qqline(resids.fit3)

Acf(resids.fit3, main='ACF of Resids F3')

The residuals of this model fit3 appear valid for the following reasons.

There appears to be constant variance. Of the 3 models this model fit3 has the most stable variance (no spikes deviating from the norm).

The qqplot suggests the residuals are approximately normally distributed.

Finally, the ACF plot suggests the residuals are not correlated since the ACF plot quickly converges to 0.

Fit4: auto.arima() without seasonal

fit4 <- auto.arima(myts, seasonal=FALSE)
fit4
## Series: myts 
## ARIMA(1,1,3) 
## 
## Coefficients:
##          ar1      ma1     ma2     ma3
##       0.7056  -0.7850  0.0641  0.3008
## s.e.  0.0789   0.0832  0.0737  0.0678
## 
## sigma^2 estimated as 7.45:  log likelihood=-635.59
## AIC=1281.18   AICc=1281.42   BIC=1299.04

Ultimately the search decided on the following ARIMA model:

ARIMA(1,1,3)

We now look at the residuals.

resids.fit4 <- residuals(fit4)
plot(resids.fit4)
abline(h=0, lty=2)

qqnorm(resids.fit4)
qqline(resids.fit4)

Acf(resids.fit4, main='ACF of Resids F4')

The residuals of model Fit4 appear to meet normality expectations, but again not as well as Fit1 or Fit3. However, our residuals are correlated at large lag values, suggesting that our residuals are not uncorrelated. Thus, Fit4 is not one of our top models.

Comparative Model Criteria (AIC)

Recall our four models.

Fit1: SARIMA(2,1,2)x(1,1,1)[12]

Fit2: SARIMA(1,2,1)x(1,1,1)[12]

Fit3: R auto.arima() function will search for the hyperparameters.

Fit4: R auto.arima() without seasonality

Since Fit4 had a clear violation of correlated residuals, we will not include it in our top 3.

Fit1 and Fit3 appeared to have residuals that met the model assumptions. In particular, normally distributed with mean 0 and constant variance independent of time and uncorrelated. Fit2, while not as strong in meeting these model assumptions, was not sharply off. Therefore we will include all three in comparing their AIC.

AIC.fit1 = AIC(fit1)
AIC.fit2 = AIC(fit2)
AIC.fit3 = AIC(fit3)

AIC.results = c(AIC.fit1, AIC.fit2, AIC.fit3)
models = c('fit1', 'fit2', 'fit3')

results.df = data.frame(models, AIC.results)
results.df
##   models AIC.results
## 1   fit1    1100.571
## 2   fit2    1118.049
## 3   fit3    1194.725

As we can see, while fit2 has the lowest AIC, fit1 has the lowest and successfully meets the model assumptions via residual diagnostics. Therefore, we can conclude that of our 4 models, Fit1 is our top recommended model.

Problem 1e): Forecasting

We will visualize the forecast of each of our 3 models.

fcast.f1 = forecast(fit1, h=24)
plot(fcast.f1, main='Fit1 Forecast')
lines(all.ts)

fcast.f2 = forecast(fit2, h=24)
plot(fcast.f2, main='Fit2 Forecast')
lines(all.ts)

fcast.f3 = forecast(fit3, h=24)
plot(fcast.f3, main='Fit3 Forecast')
lines(all.ts)

Visually we can conclude that Fit2 did the poorest in terms of forecasting. This can be seen from the wide confidence intervals for the prediction. In other words, our predicted values had a wide variance.

Fit1 and Fit3 appeared to do a better job at forecasting in the last two years. Both of their confidence intervals were tighter than Fit2.

We will use the SSR of the predicted - actual to compare Fit1 and Fit3.

SSR.f1 = sum((data$V1[265:length(data$V1)] - as.numeric(fcast.f1$fitted))^2)

SSR.f3 = sum((data$V1[265:length(data$V1)] - as.numeric(fcast.f3$fitted))^2)

SSRs = c(SSR.f1, SSR.f3)

models = c('Fit1', 'Fit3')

results.df = data.frame(models, SSRs)
results.df
##   models     SSRs
## 1   Fit1 610623.0
## 2   Fit3 610221.4

Fit3 has a lower SSR than that of Fit1 for the time series values between 2013 and 2015.

Thus, Fit3 is our top recommended model from a forecasting perspective. This is not the same model (Fit1) that we recommended in the previous section based on AIC.

Overall I would recommend Fit3. Fit3 was fit using R’s auto.arima() function and did an optimization grid search of hyperparameters for the SARIMA model. The AIC score of Fit1, while slightly lower than that of Fit3, was not significantly lower (less than 3% difference in AIC scores). Furthermore, Fit3 forecasted slightly better.

In conclusion, while Fit1 and Fit3 met model expectations (normal residuals, uncorrelated residuals) and both performed slightly the same in terms of AIC and SSR forecasting, we recommended Fit3 since our ad-hoc hyperparemeters chosen by examining the ACF and PACF is not as precise as that of a optimization grid search that R does in auto.arima().

For completeness, here is the R console output for the fitted coefficients of Fit3.

fit3
## Series: myts 
## ARIMA(1,1,0)(2,0,0)[12] 
## 
## Coefficients:
##          ar1    sar1    sar2
##       0.1878  0.3258  0.3984
## s.e.  0.0610  0.0580  0.0610
## 
## sigma^2 estimated as 5.228:  log likelihood=-593.36
## AIC=1194.72   AICc=1194.88   BIC=1209.01

Problem 1f.)

I watched the 6 videos by Tech Know How. I did find the videos useful to a certain extent, but honestly the assignment as stated did not lend itself to be analyzed in the same method as he did online. For example, he did not try to analyze the ACF or PACF plots to deduce the hyperparameters of the SARIMA model. Furtermore, he instead used built in functions like stl() to decompose the time series into its trend and seasonal components. Overall, I did not think he addressed how to start Problem 1a, 1b, and 1c very well, which ultimately set up the rest of the problem set.

However, I do think he described how to do forecasting well. Therefore, Problem 1d and 1f were fairly straightforward in following his logic and R syntax.

Overall I think he did a fine job of balancing explanation of conceptual topics and explanation of the R code. I do not think he did a good job at explaining his analysis of the ACF and PACF plots, seasonality, de-seasoned series, etc. But, I’m starting to think that no one can and that for deciding on these parameteric models we just have to start somewhere and then let some maximization occur. The problem with that is that maybe we have decided on the wrong family of parametric models.

I would have spent more time explaining what to look for in the ACF or PACF plots besides saying ‘This looks good here’ or ‘This is problematic’. Far too often the explanation boiled down to a ‘this’ or ‘it’ statement. For a learning student such as myself, such broad and unspecific statements do not help me learn as I am left to only guess as to what ‘this’ or ‘it’ refers to.