library(fpp2)
library(vars)
library(knitr)
unilever_sales = read.csv("C:/ECON 4210/unilever_sales.csv")
df = unilever_sales
df = ts(df, start=2003, frequency=4)
y = df[,"SALES"]
y %>%
stl(t.window=13, s.window="periodic", robust=TRUE) %>%
autoplot()
tsdisplay(y)
Prior to evaluating the different forecasting models, it is necessary to examin the properties of the data. By using Loess (stl), we can gain a deeper understanding of the trend, seasonality and remainder in the graph. the data appears to be non-stationary as the values tend to increase over the years, except for a few anomalies. The data also shows a lot of seasonality. The remainder component stays relatively the same with a few outliers, such as prior to 2005.
The ACF is slowly declining and it shows that there is correlation in the data. This means that the sales of Unilever is influenced by its previous sales data. The Partial ACF shows that there are a few anomilies within the first 5 lags that need to be examined further.
We will use data up until the end of 2015 for training and data from 2016 Q1 to 2019 Q2 for testing.
train <- window(y,start=c(2003, 1),end=c(2015, 4))
test <- window(y,start=2016)
both <- window(y,start=c(2003, 1))
h=length(test)
fit1 <- naive(train, h=h)
p1 = autoplot(fit1) + ylab("SALES")
fit2 <- snaive(train, h=h)
p2 = autoplot(fit2) + ylab("SALES")
tps = tslm(train ~ trend + season)
fit3 = forecast(tps, h=h)
p3 = autoplot(fit3) + ylab("SALES")
STL <- stl(train, t.window=15, s.window="periodic", robust=TRUE)
fit4 <- forecast(STL, method="rwdrift",h=h)
p4 = autoplot(fit4) + ylab("SALES")
ETS <- ets(train, model="ZZZ")
fit5 <- forecast(ETS, h=h)
p5 = autoplot(fit5) + ylab("SALES")
ndiffs(y)
## [1] 1
[1] 1
1 difference is required for the dataset. Auto Arima will ensure this happens.
fit6 = forecast(auto.arima(train), h=h)
p6 = autoplot(fit6) + ylab("SALES")
fit7 = forecast(nnetar(train, lambda=0),PI=TRUE, h=h)
p7 = autoplot(fit7) + ylab("SALES")
gridExtra::grid.arrange(p1, p2, nrow=1)
gridExtra::grid.arrange(p3, p4, nrow=1)
gridExtra::grid.arrange(p5, p6, nrow=1)
gridExtra::grid.arrange(p7, nrow=1)
It appears that most of the models are good for capturing the seasonality and trend in the data expect for the Naive method and NNETAR. These two models don’t appear to have captured the trend and seasonality components in the data.
a1 = accuracy(fit1, test)
a2 = accuracy(fit2, test)
a3 = accuracy(fit3, test)
a4 = accuracy(fit4, test)
a5 = accuracy(fit5, test)
a6 = accuracy(fit6, test)
a7 = accuracy(fit7, test)
a.table<-rbind(a1, a2, a3, a4, a5, a6, a7)
row.names(a.table) = c('Naive training','Naive test','SNaive training','SNaive test','TSLM training','TSLM test','STL training','STL test','ETS training','ETS test','Auto Arima training','Auto Arima test','NNETAR training','NNETAR test')
a.table<-as.data.frame(a.table)
a.table<-a.table[order(a.table$MASE),]
a.table
## ME RMSE MAE MPE MAPE
## ETS test -1.440756e+02 443.0392 338.4432 -1.18901222 2.608628
## SNaive test -2.299286e+02 429.1045 358.5000 -1.81015959 2.761429
## Auto Arima training 1.740430e+02 597.4257 375.9920 1.61317194 3.590891
## STL training -2.674844e-13 608.8270 383.3204 -0.24715258 3.733277
## ETS training 6.992507e+01 588.3705 397.4980 0.38164598 3.799414
## NNETAR training 2.685263e+01 653.1317 454.2504 -0.19780934 4.199233
## TSLM test -6.902002e+01 568.3386 474.8073 -0.64001368 3.651071
## STL test -3.899278e+02 626.5788 509.5432 -3.09725408 3.961442
## NNETAR test 2.190661e+02 638.3111 521.3303 1.46310631 3.903082
## Naive test 2.292143e+02 657.3726 545.6429 1.53350639 4.085683
## Auto Arima test -6.421706e+02 844.1121 689.1401 -5.01724463 5.360577
## TSLM training -1.050036e-13 899.5068 699.2912 -0.67256281 6.370121
## SNaive training 1.218750e+02 984.3011 709.4167 0.53427371 6.527860
## Naive training 3.611765e+01 923.1701 716.1961 -0.09343404 6.683591
## MASE ACF1 Theil's U
## ETS test 0.4770725 0.45289735 0.5374758
## SNaive test 0.5053448 0.68242367 0.5304308
## Auto Arima training 0.5300016 -0.14158786 NA
## STL training 0.5403319 -0.30660304 NA
## ETS training 0.5603167 -0.03135721 NA
## NNETAR training 0.6403154 0.16840605 NA
## TSLM test 0.6692926 0.67545994 0.6978107
## STL test 0.7182566 0.58732776 0.7725885
## NNETAR test 0.7348718 0.12155403 0.7998820
## Naive test 0.7691430 0.07536873 0.8170365
## Auto Arima test 0.9714180 0.70981836 1.0514358
## TSLM training 0.9857270 0.72480530 NA
## SNaive training 1.0000000 0.60555810 NA
## Naive training 1.0095563 -0.17270665 NA
ETS performed the best since it had a lower MAPE and MASE than the other models. However, there are other models that are a decent fit to the data, such as Seasonal Naive, TSLM, and STL. NNETAR, Naive and Auto Arima performed the worst in this analysis as they had the largest MAPE and MASE. Neural networks (NNETAR) was likely ineffective since there is not enough data in the dataset and there isn’t that much non-linearity in the data. I expected Auto Arima to perform better since it runs all the possibilities to find the most optimal model.
CNR_data <- read.csv("C:/ECON 4210/as4_data.csv")
df = CNR_data
df = ts(df, start=2000, frequency=4)
y = df[,"CNR"]
tsdisplay(y)
Prior to analysing the forecasts for the data, it is neceassy to examine the properties of the data. The first plot shows that the sales have increased gradually with a few expections such as the 2008-2009 recession. The ACF graph shows a decreasing trend and is extended beyond the blue lines. This means that the data is strongly correlated and affected by previous data points. Using the Partial Auto Correlation plot, we can see that there is an outlier in the first lag and we should be aware of that.
cor(df[,2:4])
## CNR epu cs
## CNR 1.0000000 0.1604483 0.1167658
## epu 0.1604483 1.0000000 -0.5998639
## cs 0.1167658 -0.5998639 1.0000000
The table shows the relationship between the 3 variable; Canadian National Railway Revenue (CNR), US Consumer Sentiment (cs), and US Economic Policy Uncertainty (epu). The data shows that there is a moderate negative relationship beween cs and epu (about 60%). The other two relationships are weak and positive relationships. This will be considered when moving foreward with the analysis.
vardata = log (df[,c(4,3,2)])
plot(vardata, main = "VAR data", xlab = "")
As can be seen from the graph, the data is non-stationary and we will have to difference the data to ensure that its stationary. We can clearly see the moderate negative relationship between cs and epu. These 2 variables tend to move in opposite directions. Besides this, there are no other significant patterns in the data.
VARselect(vardata, lag.max = 9, type = "both", season=4)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 2 1 1 2
##
## $criteria
## 1 2 3 4
## AIC(n) -1.573679e+01 -1.580052e+01 -1.561701e+01 -1.542691e+01
## HQ(n) -1.542850e+01 -1.537662e+01 -1.507750e+01 -1.477178e+01
## SC(n) -1.495971e+01 -1.473204e+01 -1.425712e+01 -1.377561e+01
## FPE(n) 1.468800e-07 1.385142e-07 1.679024e-07 2.059033e-07
## 5 6 7 8
## AIC(n) -1.534359e+01 -1.528366e+01 -1.520204e+01 -1.517417e+01
## HQ(n) -1.457286e+01 -1.439732e+01 -1.420008e+01 -1.405661e+01
## SC(n) -1.340089e+01 -1.304956e+01 -1.267653e+01 -1.235726e+01
## FPE(n) 2.283760e-07 2.493967e-07 2.810094e-07 3.035711e-07
## 9
## AIC(n) -1.504246e+01
## HQ(n) -1.380928e+01
## SC(n) -1.193414e+01
## FPE(n) 3.690039e-07
The output above illustrates the lag length selected by all the criteria in the VARS package. VAR(1) was selected by SC as well as HQ, while VAR(2) was selected by AIC and FPE. Since there is a discrepancy, we will fit VAR(1) first.
var1 <- VAR(vardata, p=1, type="both", season=4)
serial.test(var1, lags.pt=16, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var1
## Chi-squared = 108.84, df = 135, p-value = 0.9523
var2 <- VAR(vardata, p=2, type="both", season=4)
serial.test(var2, lags.pt=16, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var2
## Chi-squared = 93.623, df = 126, p-value = 0.9862
The residuals for boths models pass the test since the p-value was greater than 0.05.
var.1 = VAR(vardata, p=1, type = "both", season =4)
summary(var.1)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: cs, epu, CNR
## Deterministic variables: both
## Sample size: 77
## Log Likelihood: 295.034
## Roots of the characteristic polynomial:
## 0.905 0.8032 0.5827
## Call:
## VAR(y = vardata, p = 1, type = "both", season = 4L)
##
##
## Estimation results for equation cs:
## ===================================
## cs = cs.l1 + epu.l1 + CNR.l1 + const + trend + sd1 + sd2 + sd3
##
## Estimate Std. Error t value Pr(>|t|)
## cs.l1 0.9009361 0.0595690 15.124 <2e-16 ***
## epu.l1 -0.0072481 0.0363265 -0.200 0.842
## CNR.l1 -0.0149650 0.1261235 -0.119 0.906
## const 0.5636787 1.0116537 0.557 0.579
## trend 0.0005837 0.0017650 0.331 0.742
## sd1 0.0169272 0.0200933 0.842 0.402
## sd2 -0.0008554 0.0201108 -0.043 0.966
## sd3 -0.0197770 0.0200004 -0.989 0.326
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.06147 on 69 degrees of freedom
## Multiple R-Squared: 0.8399, Adjusted R-squared: 0.8237
## F-statistic: 51.72 on 7 and 69 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation epu:
## ====================================
## epu = cs.l1 + epu.l1 + CNR.l1 + const + trend + sd1 + sd2 + sd3
##
## Estimate Std. Error t value Pr(>|t|)
## cs.l1 -0.136720 0.150421 -0.909 0.3666
## epu.l1 0.747630 0.091730 8.150 1.06e-11 ***
## CNR.l1 -0.186615 0.318481 -0.586 0.5598
## const 3.102327 2.554577 1.214 0.2287
## trend 0.003250 0.004457 0.729 0.4683
## sd1 -0.044173 0.050739 -0.871 0.3870
## sd2 -0.105439 0.050783 -2.076 0.0416 *
## sd3 0.005206 0.050504 0.103 0.9182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.1552 on 69 degrees of freedom
## Multiple R-Squared: 0.7136, Adjusted R-squared: 0.6846
## F-statistic: 24.56 on 7 and 69 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation CNR:
## ====================================
## CNR = cs.l1 + epu.l1 + CNR.l1 + const + trend + sd1 + sd2 + sd3
##
## Estimate Std. Error t value Pr(>|t|)
## cs.l1 0.024442 0.039897 0.613 0.542133
## epu.l1 -0.051533 0.024330 -2.118 0.037773 *
## CNR.l1 0.642412 0.084472 7.605 1.05e-10 ***
## const 2.687518 0.677564 3.966 0.000176 ***
## trend 0.005126 0.001182 4.336 4.84e-05 ***
## sd1 -0.048924 0.013458 -3.635 0.000530 ***
## sd2 0.017281 0.013469 1.283 0.203783
## sd3 -0.019313 0.013395 -1.442 0.153898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.04117 on 69 degrees of freedom
## Multiple R-Squared: 0.9837, Adjusted R-squared: 0.9821
## F-statistic: 596.7 on 7 and 69 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## cs epu CNR
## cs 0.0037781 -0.0036034 -0.0002039
## epu -0.0036034 0.0240907 -0.0001099
## CNR -0.0002039 -0.0001099 0.0016948
##
## Correlation matrix of residuals:
## cs epu CNR
## cs 1.00000 -0.37771 -0.08057
## epu -0.37771 1.00000 -0.01719
## CNR -0.08057 -0.01719 1.00000
roots(var.1)
## [1] 0.9050434 0.8032001 0.5827345
We can look at the R^2 values to determine how well the model did. The estimated R^2 for cs is about 84% which is good. The R^2 for epu is about 71% which is also a good sign for the model. And lastly, the R^2 for CNR is about 98% which indicates that the model has performed very well.
All the roots are less than 1 which implies a fitting VAR model.
plot(var.1, names = "cs")
plot(var.1, names = "epu")
plot(var.1, names = "CNR")
The plots show a good fit in the models as the dotted blue line was capable of effectively capturing most of the data.
acf(residuals(var.1), type="partial", lag.max=10)
Acf(residuals(var.1)[,"cs"], main="ACF of cs")
Acf(residuals(var.1)[,"epu"], main="ACF of epu")
Acf(residuals(var.1)[,"CNR"], main="ACF of CNR")
For the most part, all of the 3 variables stay within the blue lines indicating low levels of correlation for the residuals. For cs, we see some correlation after the 10th lag, but it doesn’t go past the blue lines. For epu, we see some correlation at the 17th lag, but again, it doesn’t go past the blue lines. The same applies for CNR for the first few lags.
serial.test(var.1, lags.pt = 16, type = "PT.adjusted")
##
## Portmanteau Test (adjusted)
##
## data: Residuals of VAR object var.1
## Chi-squared = 123, df = 135, p-value = 0.7618
Since the p-value of 0.76 is greater than 0.05, there is not enough evidence of auto correlation and the model is suitable.
causality(var.1, cause= c("cs", "epu" ))
## $Granger
##
## Granger causality H0: cs epu do not Granger-cause CNR
##
## data: VAR object var.1
## F-Test = 4.1334, df1 = 2, df2 = 207, p-value = 0.01737
##
##
## $Instant
##
## H0: No instantaneous causality between: cs epu and CNR
##
## data: VAR object var.1
## Chi-squared = 0.69719, df = 2, p-value = 0.7057
The first test generated a low p-value of 0.017. This means that we can reject the null of no autocorrelation, and that there is some correlation between cs and epu. The 2nd test, however, generated a high p-value of 0.70. This means that we cannot reject the null and that the cs, epu, and CNR do not show signs of instant causality.
impulse response (IR) is an effective technique for performing what-if analysis (e.g. what happens if A increases to B by C percent?)
var1a.irf <- irf(var.1, n.ahead = 16, boot = TRUE, runs=500, seed=99, cumulative=FALSE)
par(mfrow=c(3,3))
plot(var1a.irf, plot.type = "single")
par(mfrow=c(2,2))
plot( irf(var.1, response = "CNR", n.ahead = 24, boot = TRUE, runs=500) , plot.type = "single")
par(mfrow=c(1,1))
The 1st graph shows an interesting relationship between the cs and CNR. It shows that cs causes a significant impulse response to CNR and that cs has some influence.
The 2nd graph shows that epu also has influence on CNR as it created a big response as well. It appears that cs and epu have opposite affects on CNR.
The 3rd graph is irrelevant since CNR is being compared to itself.
fevd(var.1, n.ahead = 16)
## $cs
## cs epu CNR
## [1,] 1.0000000 0.0000000000 0.000000e+00
## [2,] 0.9997976 0.0001478543 5.451247e-05
## [3,] 0.9995115 0.0003648783 1.235719e-04
## [4,] 0.9992259 0.0005892430 1.848536e-04
## [5,] 0.9989700 0.0007967427 2.332436e-04
## [6,] 0.9987509 0.0009795881 2.695285e-04
## [7,] 0.9985670 0.0011369421 2.960663e-04
## [8,] 0.9984140 0.0012707033 3.152501e-04
## [9,] 0.9982873 0.0013836516 3.290567e-04
## [10,] 0.9981823 0.0014786610 3.389912e-04
## [11,] 0.9980955 0.0015583902 3.461558e-04
## [12,] 0.9980235 0.0016251846 3.513421e-04
## [13,] 0.9979638 0.0016810672 3.551137e-04
## [14,] 0.9979144 0.0017277639 3.578705e-04
## [15,] 0.9978734 0.0017667393 3.598963e-04
## [16,] 0.9978394 0.0017992324 3.613929e-04
##
## $epu
## cs epu CNR
## [1,] 0.1426629 0.8573371 0.000000000
## [2,] 0.1587004 0.8397784 0.001521264
## [3,] 0.1742164 0.8222125 0.003571109
## [4,] 0.1888546 0.8057224 0.005423030
## [5,] 0.2023709 0.7907804 0.006848679
## [6,] 0.2146225 0.7775287 0.007848743
## [7,] 0.2255539 0.7659413 0.008504816
## [8,] 0.2351778 0.7559116 0.008910616
## [9,] 0.2435562 0.7472978 0.009146003
## [10,] 0.2507822 0.7399469 0.009270958
## [11,] 0.2569660 0.7337066 0.009327411
## [12,] 0.2622239 0.7284328 0.009343302
## [13,] 0.2666705 0.7239930 0.009336569
## [14,] 0.2704141 0.7202675 0.009318358
## [15,] 0.2735541 0.7171505 0.009295356
## [16,] 0.2761796 0.7145490 0.009271406
##
## $CNR
## cs epu CNR
## [1,] 0.006491575 0.002645605 0.9908628
## [2,] 0.006770846 0.032922864 0.9603063
## [3,] 0.016289223 0.070770389 0.9129404
## [4,] 0.030733027 0.104474159 0.8647928
## [5,] 0.046790033 0.130146865 0.8230631
## [6,] 0.062562670 0.148082929 0.7893544
## [7,] 0.077141561 0.159937525 0.7629209
## [8,] 0.090172375 0.167440261 0.7423874
## [9,] 0.101585109 0.171988634 0.7264263
## [10,] 0.111447791 0.174600519 0.7139517
## [11,] 0.119890767 0.175978658 0.7041306
## [12,] 0.127067612 0.176592751 0.6963396
## [13,] 0.133134856 0.176749618 0.6901155
## [14,] 0.138241598 0.176645862 0.6851125
## [15,] 0.142524487 0.176405048 0.6810705
## [16,] 0.146105715 0.176103090 0.6777912
It appears that epu has a greater overall value than cs which indicates that its influence might be slightly stronger.
var.fc = forecast(var.1, h=5)
autoplot(var.fc) + xlab("year")
The forecast of CNR captures trend well as it forecasts an increasing trend in values. The forecast of cs shows that consumer sentiment will be stable in the future. However, epu forecasted a dip leading to 2020 and afterwards. This is expected due to the volatile nature of economic policy uncertainty.