Introduction ‘Time’ is the most important factor which ensures success in a business. It’s difficult to keep up with the pace of time. But, technology has developed some powerful methods using which we can ‘see things’ ahead of time. Don’t worry, I am not talking about Time Machine. Let’s be realistic here!
I’m talking about the methods of prediction & forecasting. One such method, which deals with time based data is Time Series Modeling. As the name suggests, it involves working on time (years, days, hours, minutes) based data, to derive hidden insights to make informed decision making.
Time series models are very useful models when you have serially correlated data. Most of business houses work on time series data to analyze sales number for the next year, website traffic, competition position and much more. However, it is also one of the areas, which many analysts do not understand.
So, if you aren’t sure about complete process of time series modeling, this guide would introduce you to various levels of time series modeling and its related techniques.
Following is the code which will help you load the data set and spill out a few top level metrics.
library(readxl)
ubs_ts_G <- read_excel("~/ubs-produkAGO-TS.xlsx", sheet="ubs-produkG-TS")
library(astsa, quietly=TRUE, warn.conflicts=FALSE)
## Warning: package 'astsa' was built under R version 3.5.2
require(knitr)
## Loading required package: knitr
prod_G<- ts(ubs_ts_G$A_22K, frequency=52, start=c(2014,1))
This tells you that the data series is in a time series format
class(prod_G)
## [1] "ts"
This is the start of the time series
start(prod_G)
## [1] 2014 1
This is the end of the time series
end(prod_G)
## [1] 2017 27
The cycle of this time series is 52 weeks in a year
frequency(prod_G)
## [1] 52
Summary of Time Series Data
summary(prod_G)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.946 2.050 4.713 5.610 7.257 48.244
The number of passengers are distributed across the spectrum
This will plot the time series
plot(prod_G)
This will fit in a line
plot(prod_G)
abline(reg=lm(prod_G~time(prod_G)))
Here are my observations :
There is a trend component which grows the product year by year.
There looks to be a seasonal component which has a cycle less than 48 weekss.
The variance in the data keeps on increasing with time.
We know that we need to address two issues before we test stationary series. One, we need to remove unequal variances. We do this using log of the series. Two, we need to address the trend component. We do this by taking difference of the series. Now, let’s test the resultant series.
library(tseries)
## Warning: package 'tseries' was built under R version 3.5.2
adf.test(diff(prod_G), alternative="stationary", k=0)
## Warning in adf.test(diff(prod_G), alternative = "stationary", k = 0): p-
## value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(prod_G)
## Dickey-Fuller = -21.315, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
We see that the series is stationary enough to do any kind of time series modelling.
Next step is to find the right parameters to be used in the ARIMA model. We already know that the ‘d’ component is 1 as we need 1 difference to make the series stationary. We do this using the Correlation plots. Following are the ACF plots for the series :
acf(prod_G)
What do you see in the chart shown above?
Clearly, the decay of ACF chart is very slow, which means that the population is not stationary. We have already discussed above that we now intend to regress on the difference of logs rather than log directly. Let’s see how ACF and PACF curve come out after regressing on the difference.
#ACF Plots
acf(diff(prod_G))
pacf(diff((prod_G)))
Clearly, ACF plot cuts off after the first lag. Hence, we understood that value of p should be 0 as the ACF is the curve getting a cut off. While value of q should be 1 or 2. After a few iterations, we found that (0,1,1) as (p,d,q) comes out to be the combination with least AIC and BIC.
Let’s fit an ARIMA model and predict the future 10 years. Also, we will try fitting in a seasonal component in the ARIMA formulation. Then, we will visualize the prediction along with the training data. You can use the following code to do the same :
(fit <- arima(prod_G, c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 48)))
##
## Call:
## arima(x = prod_G, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 48))
##
## Coefficients:
## ma1 sma1
## -0.8988 -0.484
## s.e. 0.0413 0.150
##
## sigma^2 estimated as 36.58: log likelihood = -438.41, aic = 882.83
pred <- predict(fit, n.ahead = 48)
ts.plot(prod_G,pred$pred,log = "y", lty = c(1,3))
## Warning in xy.coords(x = matrix(rep.int(tx, k), ncol = k), y = x, log =
## log, : 5 y values <= 0 omitted from logarithmic plot
pred
## $pred
## Time Series:
## Start = c(2017, 28)
## End = c(2018, 23)
## Frequency = 52
## [1] 7.346728 3.833494 7.917632 4.386815 4.299840 4.896423 2.965208
## [8] 3.969512 1.880066 3.556556 5.032759 4.025050 5.229389 3.887650
## [15] 3.492819 9.157904 2.647430 4.601193 5.492713 4.845295 6.141357
## [22] 10.532024 10.276718 5.943699 5.648713 9.718768 7.730575 4.794835
## [29] 7.184079 11.140603 6.451480 10.041034 6.064921 8.843345 13.689631
## [36] 7.192515 10.393211 12.003248 14.484458 16.045549 5.728034 9.358081
## [43] 7.706149 9.715024 8.338808 6.151603 4.785264 5.202473
##
## $se
## Time Series:
## Start = c(2017, 28)
## End = c(2018, 23)
## Frequency = 52
## [1] 6.078106 6.109144 6.140026 6.170752 6.201327 6.231751 6.262028
## [8] 6.292159 6.322146 6.334436 6.363998 6.393424 6.422716 6.451874
## [15] 6.480901 6.509799 6.538569 6.567213 6.595732 6.624129 6.652405
## [22] 6.680560 6.708598 6.736519 6.764325 6.792017 6.819596 6.847065
## [29] 6.874423 6.901674 6.928817 6.955854 6.982786 7.009615 7.036342
## [36] 7.062967 7.089493 7.115919 7.142248 7.168480 7.194617 7.220659
## [43] 7.246607 7.272463 7.298227 7.323900 7.349484 7.374979
cycle(prod_G)
## Time Series:
## Start = c(2014, 1)
## End = c(2017, 27)
## Frequency = 52
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [70] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## [93] 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5 6 7 8 9 10 11
## [116] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [139] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
## [162] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#This will print the cycle across years.
plot(aggregate(prod_G,FUN=mean))
#This will aggregate the cycles and display a year on year trend
boxplot(prod_G~cycle(prod_G))
#Box plot across months will give us a sense on seasonal effect
A seasonal time series, in addition to the trend and random components, also has a seasonal component. Decomposing a seasonal time series means separating the time series into these three components.
prod_GComp <- decompose(prod_G)
plot(prod_GComp)
require(tseries)
require(forecast)
## Loading required package: forecast
##
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
##
## gas
require(astsa)
dif<-diff(prod_G)
fit = auto.arima(dif)
pred = predict(fit, n.ahead = 48)
ts.plot(dif, pred$pred, lty = c(1,3), col=c(5,2))
prod_G_pred<-prod_G[length(prod_G)]
for(i in 1:length(pred$pred)){
prod_G_pred[i+1]<-prod_G_pred[i]+pred$pred[i]
}
plot(c(prod_G,prod_G_pred),type="l")
print(prod_G_pred)
## [1] 1.562540 7.539856 6.395167 6.203222 6.705155 6.516831 6.250167
## [8] 6.379898 6.158543 6.239613 6.201016 6.124979 5.960771 5.814241
## [15] 6.173789 6.424231 6.215045 6.637845 6.364344 6.144614 7.403649
## [22] 6.118046 6.295341 6.603425 6.364767 6.818434 7.909279 7.745537
## [29] 6.797765 5.906716 7.405399 6.716805 6.491469 7.039601 7.737940
## [36] 6.304313 7.724945 6.396153 7.522802 8.719060 7.117431 7.549318
## [43] 7.463314 8.271586 8.205949 6.361777 7.211781 6.774783 7.268018