load packages

library(fpp2)
library(vars)
library(knitr)

Loading dataset

unilever_sales = read.csv("C:/ECON 4210/unilever_sales.csv")
df = unilever_sales
df = ts(df, start=2003, frequency=4)
y = df[,"SALES"]

Understanding the Data

y %>%
  stl(t.window=13, s.window="periodic", robust=TRUE) %>%
  autoplot()

tsdisplay(y)

Prior to evaluating the different forecasting models, it is necessary to examin the properties of the data. By using Loess (stl), we can gain a deeper understanding of the trend, seasonality and remainder in the graph. the data appears to be non-stationary as the values tend to increase over the years, except for a few anomalies. The data also shows a lot of seasonality. The remainder component stays relatively the same with a few outliers, such as prior to 2005.

The ACF is slowly declining and it shows that there is correlation in the data. This means that the sales of Unilever is influenced by its previous sales data. The Partial ACF shows that there are a few anomilies within the first 5 lags that need to be examined further.

Training & Testing the data

We will use data up until the end of 2015 for training and data from 2016 Q1 to 2019 Q2 for testing.

train <- window(y,start=c(2003, 1),end=c(2015, 4))
test <- window(y,start=2016)
both <- window(y,start=c(2003, 1))
h=length(test)

Forecasting Benchmarks

fit1 <- naive(train, h=h)
p1 = autoplot(fit1) + ylab("SALES")

fit2 <- snaive(train, h=h)
p2 = autoplot(fit2) + ylab("SALES")

TSLM Method

tps = tslm(train ~ trend + season)
fit3 = forecast(tps, h=h)
p3 = autoplot(fit3) + ylab("SALES")

STL Method

STL <- stl(train, t.window=15, s.window="periodic", robust=TRUE)
fit4 <- forecast(STL, method="rwdrift",h=h)
p4 = autoplot(fit4) + ylab("SALES")

ETS Method

ETS <- ets(train, model="ZZZ") 
fit5 <- forecast(ETS, h=h)
p5 = autoplot(fit5) + ylab("SALES")

Auto ARIMA Method

ndiffs(y)
## [1] 1

[1] 1

1 difference is required for the dataset. Auto Arima will ensure this happens.

Auto Arima

fit6 = forecast(auto.arima(train), h=h)
p6 = autoplot(fit6) + ylab("SALES")

Neural Network model (NNETAR)

fit7 = forecast(nnetar(train, lambda=0),PI=TRUE, h=h)
p7 = autoplot(fit7) + ylab("SALES")

Naive and SNaive Method Forecasts

gridExtra::grid.arrange(p1, p2, nrow=1)

TSLM and STL Method Forecasts

gridExtra::grid.arrange(p3, p4, nrow=1)

ETS, Auto Arima, and NNETAR Method Forecasts

gridExtra::grid.arrange(p5, p6, nrow=1)

gridExtra::grid.arrange(p7, nrow=1)

It appears that most of the models are good for capturing the seasonality and trend in the data expect for the Naive method and NNETAR. These two models don’t appear to have captured the trend and seasonality components in the data.

Accuracy Measures

a1 = accuracy(fit1, test)
a2 = accuracy(fit2, test)
a3 = accuracy(fit3, test)
a4 = accuracy(fit4, test)
a5 = accuracy(fit5, test)
a6 = accuracy(fit6, test)
a7 = accuracy(fit7, test)

a.table<-rbind(a1, a2, a3, a4, a5, a6, a7)
row.names(a.table) = c('Naive training','Naive test','SNaive training','SNaive test','TSLM training','TSLM test','STL training','STL test','ETS training','ETS test','Auto Arima training','Auto Arima test','NNETAR training','NNETAR test')

a.table<-as.data.frame(a.table)
a.table<-a.table[order(a.table$MASE),]
a.table
##                                ME     RMSE      MAE         MPE     MAPE
## ETS test            -1.440756e+02 443.0392 338.4432 -1.18901222 2.608628
## SNaive test         -2.299286e+02 429.1045 358.5000 -1.81015959 2.761429
## Auto Arima training  1.740430e+02 597.4257 375.9920  1.61317194 3.590891
## STL training        -2.674844e-13 608.8270 383.3204 -0.24715258 3.733277
## ETS training         6.992507e+01 588.3705 397.4980  0.38164598 3.799414
## NNETAR training      2.685263e+01 653.1317 454.2504 -0.19780934 4.199233
## TSLM test           -6.902002e+01 568.3386 474.8073 -0.64001368 3.651071
## STL test            -3.899278e+02 626.5788 509.5432 -3.09725408 3.961442
## NNETAR test          2.190661e+02 638.3111 521.3303  1.46310631 3.903082
## Naive test           2.292143e+02 657.3726 545.6429  1.53350639 4.085683
## Auto Arima test     -6.421706e+02 844.1121 689.1401 -5.01724463 5.360577
## TSLM training       -1.050036e-13 899.5068 699.2912 -0.67256281 6.370121
## SNaive training      1.218750e+02 984.3011 709.4167  0.53427371 6.527860
## Naive training       3.611765e+01 923.1701 716.1961 -0.09343404 6.683591
##                          MASE        ACF1 Theil's U
## ETS test            0.4770725  0.45289735 0.5374758
## SNaive test         0.5053448  0.68242367 0.5304308
## Auto Arima training 0.5300016 -0.14158786        NA
## STL training        0.5403319 -0.30660304        NA
## ETS training        0.5603167 -0.03135721        NA
## NNETAR training     0.6403154  0.16840605        NA
## TSLM test           0.6692926  0.67545994 0.6978107
## STL test            0.7182566  0.58732776 0.7725885
## NNETAR test         0.7348718  0.12155403 0.7998820
## Naive test          0.7691430  0.07536873 0.8170365
## Auto Arima test     0.9714180  0.70981836 1.0514358
## TSLM training       0.9857270  0.72480530        NA
## SNaive training     1.0000000  0.60555810        NA
## Naive training      1.0095563 -0.17270665        NA

ETS performed the best since it had a lower MAPE and MASE than the other models. However, there are other models that are a decent fit to the data, such as Seasonal Naive, TSLM, and STL. NNETAR, Naive and Auto Arima performed the worst in this analysis as they had the largest MAPE and MASE. Neural networks (NNETAR) was likely ineffective since there is not enough data in the dataset and there isn’t that much non-linearity in the data. I expected Auto Arima to perform better since it runs all the possibilities to find the most optimal model.

Question 2

loading dataset & creating Time Series

CNR_data <- read.csv("C:/ECON 4210/as4_data.csv")

df = CNR_data 
df = ts(df, start=2000, frequency=4)
y = df[,"CNR"]

Properties of dataset

tsdisplay(y)

Prior to analysing the forecasts for the data, it is neceassy to examine the properties of the data. The first plot shows that the sales have increased gradually with a few expections such as the 2008-2009 recession. The ACF graph shows a decreasing trend and is extended beyond the blue lines. This means that the data is strongly correlated and affected by previous data points. Using the Partial Auto Correlation plot, we can see that there is an outlier in the first lag and we should be aware of that.

Vector Autoregression (VAR) model

Select the variables

cor(df[,2:4])
##           CNR        epu         cs
## CNR 1.0000000  0.1604483  0.1167658
## epu 0.1604483  1.0000000 -0.5998639
## cs  0.1167658 -0.5998639  1.0000000

The table shows the relationship between the 3 variable; Canadian National Railway Revenue (CNR), US Consumer Sentiment (cs), and US Economic Policy Uncertainty (epu). The data shows that there is a moderate negative relationship beween cs and epu (about 60%). The other two relationships are weak and positive relationships. This will be considered when moving foreward with the analysis.

Checking data and making adjustments

vardata = log (df[,c(4,3,2)])
plot(vardata, main = "VAR data", xlab = "")

As can be seen from the graph, the data is non-stationary and we will have to difference the data to ensure that its stationary. We can clearly see the moderate negative relationship between cs and epu. These 2 variables tend to move in opposite directions. Besides this, there are no other significant patterns in the data.

Selecting order of models

VARselect(vardata, lag.max = 9, type = "both", season=4)
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      1      1      2 
## 
## $criteria
##                    1             2             3             4
## AIC(n) -1.573679e+01 -1.580052e+01 -1.561701e+01 -1.542691e+01
## HQ(n)  -1.542850e+01 -1.537662e+01 -1.507750e+01 -1.477178e+01
## SC(n)  -1.495971e+01 -1.473204e+01 -1.425712e+01 -1.377561e+01
## FPE(n)  1.468800e-07  1.385142e-07  1.679024e-07  2.059033e-07
##                    5             6             7             8
## AIC(n) -1.534359e+01 -1.528366e+01 -1.520204e+01 -1.517417e+01
## HQ(n)  -1.457286e+01 -1.439732e+01 -1.420008e+01 -1.405661e+01
## SC(n)  -1.340089e+01 -1.304956e+01 -1.267653e+01 -1.235726e+01
## FPE(n)  2.283760e-07  2.493967e-07  2.810094e-07  3.035711e-07
##                    9
## AIC(n) -1.504246e+01
## HQ(n)  -1.380928e+01
## SC(n)  -1.193414e+01
## FPE(n)  3.690039e-07

The output above illustrates the lag length selected by all the criteria in the VARS package. VAR(1) was selected by SC as well as HQ, while VAR(2) was selected by AIC and FPE. Since there is a discrepancy, we will fit VAR(1) first.

var1 <- VAR(vardata, p=1, type="both", season=4)
serial.test(var1, lags.pt=16, type="PT.asymptotic")
## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var1
## Chi-squared = 108.84, df = 135, p-value = 0.9523
var2 <- VAR(vardata, p=2, type="both", season=4)
serial.test(var2, lags.pt=16, type="PT.asymptotic")
## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var2
## Chi-squared = 93.623, df = 126, p-value = 0.9862

The residuals for boths models pass the test since the p-value was greater than 0.05.

Summary

var.1 = VAR(vardata, p=1, type = "both", season =4)
summary(var.1)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: cs, epu, CNR 
## Deterministic variables: both 
## Sample size: 77 
## Log Likelihood: 295.034 
## Roots of the characteristic polynomial:
## 0.905 0.8032 0.5827
## Call:
## VAR(y = vardata, p = 1, type = "both", season = 4L)
## 
## 
## Estimation results for equation cs: 
## =================================== 
## cs = cs.l1 + epu.l1 + CNR.l1 + const + trend + sd1 + sd2 + sd3 
## 
##          Estimate Std. Error t value Pr(>|t|)    
## cs.l1   0.9009361  0.0595690  15.124   <2e-16 ***
## epu.l1 -0.0072481  0.0363265  -0.200    0.842    
## CNR.l1 -0.0149650  0.1261235  -0.119    0.906    
## const   0.5636787  1.0116537   0.557    0.579    
## trend   0.0005837  0.0017650   0.331    0.742    
## sd1     0.0169272  0.0200933   0.842    0.402    
## sd2    -0.0008554  0.0201108  -0.043    0.966    
## sd3    -0.0197770  0.0200004  -0.989    0.326    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.06147 on 69 degrees of freedom
## Multiple R-Squared: 0.8399,  Adjusted R-squared: 0.8237 
## F-statistic: 51.72 on 7 and 69 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation epu: 
## ==================================== 
## epu = cs.l1 + epu.l1 + CNR.l1 + const + trend + sd1 + sd2 + sd3 
## 
##         Estimate Std. Error t value Pr(>|t|)    
## cs.l1  -0.136720   0.150421  -0.909   0.3666    
## epu.l1  0.747630   0.091730   8.150 1.06e-11 ***
## CNR.l1 -0.186615   0.318481  -0.586   0.5598    
## const   3.102327   2.554577   1.214   0.2287    
## trend   0.003250   0.004457   0.729   0.4683    
## sd1    -0.044173   0.050739  -0.871   0.3870    
## sd2    -0.105439   0.050783  -2.076   0.0416 *  
## sd3     0.005206   0.050504   0.103   0.9182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.1552 on 69 degrees of freedom
## Multiple R-Squared: 0.7136,  Adjusted R-squared: 0.6846 
## F-statistic: 24.56 on 7 and 69 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation CNR: 
## ==================================== 
## CNR = cs.l1 + epu.l1 + CNR.l1 + const + trend + sd1 + sd2 + sd3 
## 
##         Estimate Std. Error t value Pr(>|t|)    
## cs.l1   0.024442   0.039897   0.613 0.542133    
## epu.l1 -0.051533   0.024330  -2.118 0.037773 *  
## CNR.l1  0.642412   0.084472   7.605 1.05e-10 ***
## const   2.687518   0.677564   3.966 0.000176 ***
## trend   0.005126   0.001182   4.336 4.84e-05 ***
## sd1    -0.048924   0.013458  -3.635 0.000530 ***
## sd2     0.017281   0.013469   1.283 0.203783    
## sd3    -0.019313   0.013395  -1.442 0.153898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.04117 on 69 degrees of freedom
## Multiple R-Squared: 0.9837,  Adjusted R-squared: 0.9821 
## F-statistic: 596.7 on 7 and 69 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##             cs        epu        CNR
## cs   0.0037781 -0.0036034 -0.0002039
## epu -0.0036034  0.0240907 -0.0001099
## CNR -0.0002039 -0.0001099  0.0016948
## 
## Correlation matrix of residuals:
##           cs      epu      CNR
## cs   1.00000 -0.37771 -0.08057
## epu -0.37771  1.00000 -0.01719
## CNR -0.08057 -0.01719  1.00000
roots(var.1)
## [1] 0.9050434 0.8032001 0.5827345

We can look at the R^2 values to determine how well the model did. The estimated R^2 for cs is about 84% which is good. The R^2 for epu is about 71% which is also a good sign for the model. And lastly, the R^2 for CNR is about 98% which indicates that the model has performed very well.

All the roots are less than 1 which implies a fitting VAR model.

Plot cs

plot(var.1, names = "cs")

Plot epu

plot(var.1, names = "epu")

Plot CNR

plot(var.1, names = "CNR")

The plots show a good fit in the models as the dotted blue line was capable of effectively capturing most of the data.

Combined plots

acf(residuals(var.1), type="partial", lag.max=10)

Acf(residuals(var.1)[,"cs"], main="ACF of cs")

Acf(residuals(var.1)[,"epu"], main="ACF of epu")

Acf(residuals(var.1)[,"CNR"], main="ACF of CNR")

Detailed breakdown of Residuals

For the most part, all of the 3 variables stay within the blue lines indicating low levels of correlation for the residuals. For cs, we see some correlation after the 10th lag, but it doesn’t go past the blue lines. For epu, we see some correlation at the 17th lag, but again, it doesn’t go past the blue lines. The same applies for CNR for the first few lags.

serial.test(var.1, lags.pt = 16, type = "PT.adjusted")
## 
##  Portmanteau Test (adjusted)
## 
## data:  Residuals of VAR object var.1
## Chi-squared = 123, df = 135, p-value = 0.7618

Since the p-value of 0.76 is greater than 0.05, there is not enough evidence of auto correlation and the model is suitable.

Granger Causuality

causality(var.1, cause= c("cs", "epu" ))
## $Granger
## 
##  Granger causality H0: cs epu do not Granger-cause CNR
## 
## data:  VAR object var.1
## F-Test = 4.1334, df1 = 2, df2 = 207, p-value = 0.01737
## 
## 
## $Instant
## 
##  H0: No instantaneous causality between: cs epu and CNR
## 
## data:  VAR object var.1
## Chi-squared = 0.69719, df = 2, p-value = 0.7057

The first test generated a low p-value of 0.017. This means that we can reject the null of no autocorrelation, and that there is some correlation between cs and epu. The 2nd test, however, generated a high p-value of 0.70. This means that we cannot reject the null and that the cs, epu, and CNR do not show signs of instant causality.

Impulse Response Simulations

impulse response (IR) is an effective technique for performing what-if analysis (e.g. what happens if A increases to B by C percent?)

var1a.irf <- irf(var.1,  n.ahead = 16, boot = TRUE,  runs=500, seed=99, cumulative=FALSE)
par(mfrow=c(3,3))
plot(var1a.irf, plot.type = "single")

par(mfrow=c(2,2))
plot( irf(var.1, response = "CNR", n.ahead = 24, boot = TRUE,  runs=500) , plot.type = "single")

par(mfrow=c(1,1))

The 1st graph shows an interesting relationship between the cs and CNR. It shows that cs causes a significant impulse response to CNR and that cs has some influence.

The 2nd graph shows that epu also has influence on CNR as it created a big response as well. It appears that cs and epu have opposite affects on CNR.

The 3rd graph is irrelevant since CNR is being compared to itself.

fevd

fevd(var.1, n.ahead = 16)
## $cs
##              cs          epu          CNR
##  [1,] 1.0000000 0.0000000000 0.000000e+00
##  [2,] 0.9997976 0.0001478543 5.451247e-05
##  [3,] 0.9995115 0.0003648783 1.235719e-04
##  [4,] 0.9992259 0.0005892430 1.848536e-04
##  [5,] 0.9989700 0.0007967427 2.332436e-04
##  [6,] 0.9987509 0.0009795881 2.695285e-04
##  [7,] 0.9985670 0.0011369421 2.960663e-04
##  [8,] 0.9984140 0.0012707033 3.152501e-04
##  [9,] 0.9982873 0.0013836516 3.290567e-04
## [10,] 0.9981823 0.0014786610 3.389912e-04
## [11,] 0.9980955 0.0015583902 3.461558e-04
## [12,] 0.9980235 0.0016251846 3.513421e-04
## [13,] 0.9979638 0.0016810672 3.551137e-04
## [14,] 0.9979144 0.0017277639 3.578705e-04
## [15,] 0.9978734 0.0017667393 3.598963e-04
## [16,] 0.9978394 0.0017992324 3.613929e-04
## 
## $epu
##              cs       epu         CNR
##  [1,] 0.1426629 0.8573371 0.000000000
##  [2,] 0.1587004 0.8397784 0.001521264
##  [3,] 0.1742164 0.8222125 0.003571109
##  [4,] 0.1888546 0.8057224 0.005423030
##  [5,] 0.2023709 0.7907804 0.006848679
##  [6,] 0.2146225 0.7775287 0.007848743
##  [7,] 0.2255539 0.7659413 0.008504816
##  [8,] 0.2351778 0.7559116 0.008910616
##  [9,] 0.2435562 0.7472978 0.009146003
## [10,] 0.2507822 0.7399469 0.009270958
## [11,] 0.2569660 0.7337066 0.009327411
## [12,] 0.2622239 0.7284328 0.009343302
## [13,] 0.2666705 0.7239930 0.009336569
## [14,] 0.2704141 0.7202675 0.009318358
## [15,] 0.2735541 0.7171505 0.009295356
## [16,] 0.2761796 0.7145490 0.009271406
## 
## $CNR
##                cs         epu       CNR
##  [1,] 0.006491575 0.002645605 0.9908628
##  [2,] 0.006770846 0.032922864 0.9603063
##  [3,] 0.016289223 0.070770389 0.9129404
##  [4,] 0.030733027 0.104474159 0.8647928
##  [5,] 0.046790033 0.130146865 0.8230631
##  [6,] 0.062562670 0.148082929 0.7893544
##  [7,] 0.077141561 0.159937525 0.7629209
##  [8,] 0.090172375 0.167440261 0.7423874
##  [9,] 0.101585109 0.171988634 0.7264263
## [10,] 0.111447791 0.174600519 0.7139517
## [11,] 0.119890767 0.175978658 0.7041306
## [12,] 0.127067612 0.176592751 0.6963396
## [13,] 0.133134856 0.176749618 0.6901155
## [14,] 0.138241598 0.176645862 0.6851125
## [15,] 0.142524487 0.176405048 0.6810705
## [16,] 0.146105715 0.176103090 0.6777912

It appears that epu has a greater overall value than cs which indicates that its influence might be slightly stronger.

Forecast with VAR

var.fc = forecast(var.1, h=5)
autoplot(var.fc) + xlab("year")

The forecast of CNR captures trend well as it forecasts an increasing trend in values. The forecast of cs shows that consumer sentiment will be stable in the future. However, epu forecasted a dip leading to 2020 and afterwards. This is expected due to the volatile nature of economic policy uncertainty.