Math 531T Exam 3

Libraries Used

library(ggplot2)
library(tseries)
library(forecast)

Data Introduction

The dataset Demand.txt includes 24 years (Jan. 1992 - Dec. 2015) of total monthly number of customers having business with a financial firm. Interest lies in performing a complete exploratory and model fitting analysis.

We will not include the last two years (Dec.2013-Dec.2015) in the exploratory process or evaluation of the models. These last two years will be used as a validation set against our models.

Problem 1a.) Check for non-constant variance

We first will visualize the time series.

#read in data
data <- read.csv('Demand.txt', header=FALSE)
dates = seq(as.Date("1992/1/1"), as.Date("2015/12/1"), by='month')
data$DATES = dates
data$Months = months(data$DATES)

ggplot(data[1:264,], aes(DATES, V1)) + geom_line() + scale_x_date('Time') + 
  ylab('Monthly # of Customers') + xlab('') + ggtitle('Original TS (Except Last Two Years)')

Figure 1

myts <- ts(data$V1[1:264], start=c(1992, 1),frequency=12)
all.ts <- ts(data$V1, start=c(1992, 1),frequency=12)

We will comment on any general or seasonal trends in part (b.).

In this part (a.) we will discuss if the variance is constant or not.

For this time series of monthly # of customers for a financial firm we see in Figure 1 that the variance is almost constant. In other words, the fluctuations between minimum and maximum values are not changing over time. Thus, it is unlikely we will need to apply any variance-stabilization techniques. Just for completeness we will check if R recommmends a variance-stabilization technique by the BoxCox.lambda() function. If this function returns a value close to 1 then that suggests no transformation is needed to stabilize the variance.

BoxCox.lambda(myts)

## [1] 0.8943227

Therefore, it is reasonable to believe that no variance-stabilization technique is needed.

We will now discuss the stationarity of the original series by analyzing the sample ACF plot and utilizing some statistical tests.

par(mfrow=c(1,2))
Acf(myts, main='ACF of Orig.TS')
Pacf(myts, main='PACF of Orig.TS')

Figure 2

The sample ACF plot is converging slowly to the standard normal bounds. This indicates the original time series is not stationary.

There are two statistical tests we can use to check if the series if stationary: ADF and KPSS tests.

adf.test(myts, alternative='stationary') #null hypothesis not rejected

## 
##  Augmented Dickey-Fuller Test
## 
## data:  myts
## Dickey-Fuller = -1.6895, Lag order = 6, p-value = 0.7067
## alternative hypothesis: stationary

kpss.test(myts, null='Level') #null hypothesis rejected.

## Warning in kpss.test(myts, null = "Level"): p-value smaller than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  myts
## KPSS Level = 2.8897, Truncation lag parameter = 5, p-value = 0.01

As we can see from the console output, both the KPSS and ADF tests conclude that the series is not stationary. In other words, our auto-correlation is a function of both lag \(h\) and time \(t\).

Problem 1b.) General and Seasonal Trends

ggtsdisplay(myts)

Figure 3

Above in Figure 3 we can see that there is a general increasing trend in the time series plot. This means the average number of monthly customers is increasing over time. One could say there is a polynomial trend due to the bouncy nature of the time series (multiple increasing and decreasing intervals). This suggests at least \(d = 1\) in a differencing process to de-trend the series.

At first glance it is difficult to say if there is a seasonal trend. Neither the time series plot (top main) nor the ACF suggest seasonality (no wave like graph in the ACF). The PACF does have spikes at 12, 24, etc. That is the only clue that a seasonal trend may exits. Nevertheless, we will attempt to decompose the series by lagged differences in hopes of de-seasoning the series.

We will start by trying to de-season the series with a lagged difference of 12 (since we have monthly data).

y.sdiff <- diff(myts, lag=12)
ggtsdisplay(y.sdiff)

In the lagged 12 difference ACF plot above of we still see a trend (in other words the ACF does not quickly drop to 0). This suggests, as we already discussed, that we will need to de-trend the general trend by using a regular differencing of \(d\) equal 1 (linear), 2 (quadratic), or possibly 3 (polynomial trend).

We start with \(d=1\).

y.rdiff <- diff(myts)
ggtsdisplay(y.rdiff)

We see that when we performed the regular differencing we have removed the general trend (see the time series above and the ACF plot no longer has a slowly decaying graph). However, we see in the ACF plot a periodic pattern at 12, 24, 36 (i.e. every year).

This suggests we will do both. In particular we will do a lagged seasonal difference of 12 and a regular difference of \(d=1\).

y.rdiff.sdiff <- diff(diff(myts, lag=12), differences=1)
ggtsdisplay(y.rdiff.sdiff)

Our ACF plot now looks stationary.

Let us formally test if our resulting differenced series is now stationary

adf.test(y.rdiff.sdiff, alternative='stationary')

## Warning in adf.test(y.rdiff.sdiff, alternative = "stationary"): p-value smaller
## than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  y.rdiff.sdiff
## Dickey-Fuller = -6.2987, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(y.rdiff.sdiff, null='Level')

## Warning in kpss.test(y.rdiff.sdiff, null = "Level"): p-value greater than
## printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  y.rdiff.sdiff
## KPSS Level = 0.13352, Truncation lag parameter = 5, p-value = 0.1

Both the KPSS and ADF tests conclude that our new series is stationary.

Problem 1c.)

Let us look at our current differenced lag 12 and \(d=1\) series again.

ggtsdisplay(y.rdiff.sdiff)

AR Terms - non-seasonal and seasonal

Looking at the PACF plot we can suggest a value of \(p=2\) due to the 2 spikes above the confidence interval region after the initial spike at lag 0. Furthermore, since there is 1 significant spike at lag 12 we can suggest that \(P = 1\).

MA Terms - non-seasonal and seasonal

Looking at the ACF plot we can suggest a value of \(q=2\) due to the 2 spikes above the confidence interval region after the initial spike at lag 0. Furthermore, since there is 1 significant spike at lag 12 we can suggest that \(Q=1\)

d and D - non-seasonal and seasonal differencing

Our \(d=1\) because of the regular differencing we already took in part (b.). Furthermore, since we also used a lagged difference of 12, this suggests that \(D=1\)

Therefore, so far we have suggested \((p,d,q)x(P,D,Q) = (2,1,2)\)x\((1, 1, 1)[12]\)

Problem 1d.)

We will fit three models to the original time series.

Fit1: SARIMA(2,1,2)x(1,1,1)[12]

Fit2: SARIMA(1,2,1)x(1,1,1)[12]

Fit3: R auto.arima() function will search for the hyperparameters.

Fit4: R auto.arime() function with seasonal=FALSE

Fit1: SARIMA(2,1,2)x(1,1,1)[12]

fit1 <- Arima(myts, order=c(2,1,2), seasonal=c(1,1,1),lambda=NULL, include.constant=TRUE)

resids.fit1 <- residuals(fit1)
plot(resids.fit1)
abline(h=0, lty=2)

qqnorm(resids.fit1)
qqline(resids.fit1)

Acf(resids.fit1, main='ACF of Resids F1')

This fit1 model seems promising. The variance of the residuals is not perfect, but that is really only due to the one large deviance around 1997. Otherwise the residuals appear to have constant-variance that does not depend on the time. Furthermore, the residuals are approxiamtely normally distributed per the qqplot. Lastly, the residuals appear to be uncorrelated to the ACF plot converging to zero rather quickly.

Fit2: SARIMA(1,2,1)x(1,1,1)[12]

fit2 <- Arima(myts, order=c(1,2,1), seasonal=c(1,1,1),lambda=NULL, include.constant=TRUE)

resids.fit2 <- residuals(fit2)
plot(resids.fit2)
abline(h=0, lty=2)

qqnorm(resids.fit2)
qqline(resids.fit2)

Acf(resids.fit2, main='ACF of Resids F2')

This fit2 model seems less promising than fit1 model for the following reasons:

The qqplot suggests these residuals are not as normal as the previous fit1. While the variance is nearly non-constant (same spike around 1997 as before), normality of the residuals is an important assumption for model validity. Lastly, the residuals appear to be correlated due to the spike in the ACF at lag 6.

Fit3: R auto.arima() function

fit3 <- auto.arima(myts)
fit3

## Series: myts 
## ARIMA(1,1,0)(2,0,0)[12] 
## 
## Coefficients:
##          ar1    sar1    sar2
##       0.1878  0.3258  0.3984
## s.e.  0.0610  0.0580  0.0610
## 
## sigma^2 estimated as 5.228:  log likelihood=-593.36
## AIC=1194.72   AICc=1194.88   BIC=1209.01

R’s auto.arima() function will search for the hyperparameters. Ultimately the search decided on the following SARIMA model:

SARIMA(1,1,0)x(2,0,0)[12]

We now look at the residuals.

resids.fit3 <- residuals(fit3)
plot(resids.fit3)
abline(h=0, lty=2)

qqnorm(resids.fit3)
qqline(resids.fit3)

Acf(resids.fit3, main='ACF of Resids F3')

The residuals of this model fit3 appear valid for the following reasons.

There appears to be constant variance. Of the 3 models this model fit3 has the most stable variance (no spikes deviating from the norm).

The qqplot suggests the residuals are approximately normally distributed.

Finally, the ACF plot suggests the residuals are not correlated since the ACF plot quickly converges to 0.

Fit4: auto.arima() without seasonal

fit4 <- auto.arima(myts, seasonal=FALSE)
fit4

## Series: myts 
## ARIMA(1,1,3) 
## 
## Coefficients:
##          ar1      ma1     ma2     ma3
##       0.7056  -0.7850  0.0641  0.3008
## s.e.  0.0789   0.0832  0.0737  0.0678
## 
## sigma^2 estimated as 7.45:  log likelihood=-635.59
## AIC=1281.18   AICc=1281.42   BIC=1299.04

Ultimately the search decided on the following ARIMA model:

ARIMA(1,1,3)

We now look at the residuals.

resids.fit4 <- residuals(fit4)
plot(resids.fit4)
abline(h=0, lty=2)

qqnorm(resids.fit4)
qqline(resids.fit4)

Acf(resids.fit4, main='ACF of Resids F4')

The residuals of model Fit4 appear to meet normality expectations, but again not as well as Fit1 or Fit3. However, our residuals are correlated at large lag values, suggesting that our residuals are not uncorrelated. Thus, Fit4 is not one of our top models.

Comparative Model Criteria (AIC)

Recall our four models.

Fit1: SARIMA(2,1,2)x(1,1,1)[12]

Fit2: SARIMA(1,2,1)x(1,1,1)[12]

Fit3: R auto.arima() function will search for the hyperparameters.

Fit4: R auto.arima() without seasonality

Since Fit4 had a clear violation of correlated residuals, we will not include it in our top 3.

Fit1 and Fit3 appeared to have residuals that met the model assumptions. In particular, normally distributed with mean 0 and constant variance independent of time and uncorrelated. Fit2, while not as strong in meeting these model assumptions, was not sharply off. Therefore we will include all three in comparing their AIC.

AIC.fit1 = AIC(fit1)
AIC.fit2 = AIC(fit2)
AIC.fit3 = AIC(fit3)

AIC.results = c(AIC.fit1, AIC.fit2, AIC.fit3)
models = c('fit1', 'fit2', 'fit3')

results.df = data.frame(models, AIC.results)
results.df

##   models AIC.results
## 1   fit1    1100.571
## 2   fit2    1118.049
## 3   fit3    1194.725

As we can see, while fit2 has the lowest AIC, fit1 has the lowest and successfully meets the model assumptions via residual diagnostics. Therefore, we can conclude that of our 4 models, Fit1 is our top recommended model.

Problem 1e): Forecasting

We will visualize the forecast of each of our 3 models.

fcast.f1 = forecast(fit1, h=24)
plot(fcast.f1, main='Fit1 Forecast')
lines(all.ts)

fcast.f2 = forecast(fit2, h=24)
plot(fcast.f2, main='Fit2 Forecast')
lines(all.ts)

fcast.f3 = forecast(fit3, h=24)
plot(fcast.f3, main='Fit3 Forecast')
lines(all.ts)

Visually we can conclude that Fit2 did the poorest in terms of forecasting. This can be seen from the wide confidence intervals for the prediction. In other words, our predicted values had a wide variance.

Fit1 and Fit3 appeared to do a better job at forecasting in the last two years. Both of their confidence intervals were tighter than Fit2.

We will use the SSR of the predicted - actual to compare Fit1 and Fit3.

SSR.f1 = sum((data$V1[265:length(data$V1)] - as.numeric(fcast.f1$fitted))^2)

SSR.f3 = sum((data$V1[265:length(data$V1)] - as.numeric(fcast.f3$fitted))^2)

SSRs = c(SSR.f1, SSR.f3)

models = c('Fit1', 'Fit3')

results.df = data.frame(models, SSRs)
results.df

##   models     SSRs
## 1   Fit1 610623.0
## 2   Fit3 610221.4

Fit3 has a lower SSR than that of Fit1 for the time series values between 2013 and 2015.

Thus, Fit3 is our top recommended model from a forecasting perspective. This is not the same model (Fit1) that we recommended in the previous section based on AIC.

Overall I would recommend Fit3. Fit3 was fit using R’s auto.arima() function and did an optimization grid search of hyperparameters for the SARIMA model. The AIC score of Fit1, while slightly lower than that of Fit3, was not significantly lower (less than 3% difference in AIC scores). Furthermore, Fit3 forecasted slightly better.

In conclusion, while Fit1 and Fit3 met model expectations (normal residuals, uncorrelated residuals) and both performed slightly the same in terms of AIC and SSR forecasting, we recommended Fit3 since our ad-hoc hyperparemeters chosen by examining the ACF and PACF is not as precise as that of a optimization grid search that R does in auto.arima().

For completeness, here is the R console output for the fitted coefficients of Fit3.

fit3

## Series: myts 
## ARIMA(1,1,0)(2,0,0)[12] 
## 
## Coefficients:
##          ar1    sar1    sar2
##       0.1878  0.3258  0.3984
## s.e.  0.0610  0.0580  0.0610
## 
## sigma^2 estimated as 5.228:  log likelihood=-593.36
## AIC=1194.72   AICc=1194.88   BIC=1209.01

Problem 1f.)

I watched the 6 videos by Tech Know How. I did find the videos useful to a certain extent, but honestly the assignment as stated did not lend itself to be analyzed in the same method as he did online. For example, he did not try to analyze the ACF or PACF plots to deduce the hyperparameters of the SARIMA model. Furtermore, he instead used built in functions like stl() to decompose the time series into its trend and seasonal components. Overall, I did not think he addressed how to start Problem 1a, 1b, and 1c very well, which ultimately set up the rest of the problem set.

However, I do think he described how to do forecasting well. Therefore, Problem 1d and 1f were fairly straightforward in following his logic and R syntax.

Overall I think he did a fine job of balancing explanation of conceptual topics and explanation of the R code. I do not think he did a good job at explaining his analysis of the ACF and PACF plots, seasonality, de-seasoned series, etc. But, I’m starting to think that no one can and that for deciding on these parameteric models we just have to start somewhere and then let some maximization occur. The problem with that is that maybe we have decided on the wrong family of parametric models.

I would have spent more time explaining what to look for in the ACF or PACF plots besides saying ‘This looks good here’ or ‘This is problematic’. Far too often the explanation boiled down to a ‘this’ or ‘it’ statement. For a learning student such as myself, such broad and unspecific statements do not help me learn as I am left to only guess as to what ‘this’ or ‘it’ refers to.

Math 531T Exam 3

Hunter Alzate

7/27/2021

Libraries Used

Data Introduction

Problem 1a.) Check for non-constant variance

Problem 1b.) General and Seasonal Trends

Problem 1c.)

AR Terms - non-seasonal and seasonal

MA Terms - non-seasonal and seasonal

d and D - non-seasonal and seasonal differencing

Problem 1d.)

Fit1: SARIMA(2,1,2)x(1,1,1)[12]

Fit2: SARIMA(1,2,1)x(1,1,1)[12]

Fit3: R auto.arima() function

Fit4: auto.arima() without seasonal

Comparative Model Criteria (AIC)

Problem 1e): Forecasting

Problem 1f.)