IF you are planning to do some thing with time-seirse data and …

know something about R and machine learning,like dataframe,lm
get confused with ts, msts, etc(error_trend_seasonal) …
do not have time to read Springer textbooks

then read this, I will introduce:

Visualization mainly via ggplot2, what can we do and why
Time-seires objects manipulations, pitfalls and best practices
An easy tutorial of models mainly from great,great Rob Hyndman
Testing your models in a ‘backtest’ manner

By these, you can get a whole map of how to fit own your model to your own data, so let’s get start!

1 Data, packages and manipulation

packages

tidyverse : the most useful functoins in R
forecast : the most useful time-seires functions in R
fpp2 : enforced version of forecast
lubridate : after you have a date-type volumn in dataframe
zoo/xts: enforced data type of time-seires
tseries: used for arima model
TSstudio: very ambitious newe tool, but I use only for plotting

a glimpse of the rawdata

rawdata: a dataframe with 2 columns, date and value, we will mainly deal with, to train and to predict.

##       date                ivalue        
##  Min.   :2016-09-01   Min.   : 1185871  
##  1st Qu.:2017-07-12   1st Qu.: 3107670  
##  Median :2018-05-22   Median : 4418672  
##  Mean   :2018-05-22   Mean   : 5041120  
##  3rd Qu.:2019-04-01   3rd Qu.: 6565313  
##  Max.   :2020-02-09   Max.   :42395094

holiday_date: irregular events with date, can be used to improve performance of perdiction. Events like national day or lunar new year in China may have a great impact on the value. To be noticed, you should have this tags for both train and predict period, to aquire data , try to google ‘holiday date api’ .

head(holiday_date)

##           dt holiday
## 1 2020-01-01       2
## 2 2020-01-24       1
## 3 2020-01-25       2
## 4 2020-01-26       2
## 5 2020-01-27       2
## 6 2020-01-28       1

data manipulation

# extract all kinds of time tag, year, month, quarter ...
rawdata$year <- format( rawdata$date,"%Y")
rawdata$quarter <- quarters( rawdata$date, abbreviate = T )
rawdata$month <- format( rawdata$date, "%m")

# months of the year, days of the week etc
# yearmonth, month of the year . Eg: Nov 2018 
rawdata$yearmonth<- factor( as.yearmon( rawdata$date))
# days of the week 
rawdata$weekday <- weekdays( rawdata$date, abbreviate = T )
rawdata$weekday <- factor(rawdata$weekday , 
                        levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
# days of the month  
rawdata$monthday <- as.numeric( format(rawdata$date, "%d"))
# weeks of the year 
rawdata$yearweek <- as.numeric(format(rawdata$date,"%W")) 
# week of the month, normalizing the week to start at 1 for every month 
rawdata<-ddply(rawdata,.(yearmonth),transform,monthweek=1+yearweek-min(yearweek)) 

# add holiday as binary
holiday_date$dt <- as.Date( holiday_date$dt)

idata <- merge( rawdata, holiday_date ,
                      by.x ='date', by.y = 'dt', all.x = T )
idata$holiday <- ifelse( is.na(idata$holiday), 0,1 )

head(idata)

##         date  ivalue year quarter month yearmonth weekday monthday yearweek
## 1 2016-09-01 8260742 2016      Q3    09  Sep 2016     Thu        1       35
## 2 2016-09-02 8523553 2016      Q3    09  Sep 2016     Fri        2       35
## 3 2016-09-03 9035602 2016      Q3    09  Sep 2016     Sat        3       35
## 4 2016-09-04 7904068 2016      Q3    09  Sep 2016     Sun        4       35
## 5 2016-09-05 8158248 2016      Q3    09  Sep 2016     Mon        5       36
## 6 2016-09-06 6378481 2016      Q3    09  Sep 2016     Tue        6       36
##   monthweek holiday
## 1         1       0
## 2         1       0
## 3         1       0
## 4         1       0
## 5         2       0
## 6         2       0

# summary(idata)

2.visualization

By drawing some plots, we could get some sense about trend, seasonality and irregular volatility of the data and this will make modeling more effectively. Here is one of my principles in DS: no black magic, good models always make sense.

There are ways to generate plot from ts/xts objects directly, but simple using dataframe and ggplot2 is already good to go.

Three elements of plotting:
1.the numeric value
2.all kinds of time intervals( years, months, days in a week… )
3.organize them

Here are some examples :

2.1 data divided by intervals

# distribution 
ggplot( idata, aes( x=ivalue )) + 
  geom_density( ) + 
  geom_vline(aes(xintercept=median(ivalue)),
            color="blue", linetype="dashed", size=1)

# time serise divided by intervals( year, month , etc ) 
idata %>%
  filter( ivalue <= quantile(idata$ivalue, 0.975)  ) %>% 
    ggplot(aes(x=date, y=ivalue)) + 
    geom_jitter(alpha=0.5) +
    geom_smooth() +
    ggtitle("yearly pattern, outlier removed") + 
    facet_wrap( ~ year, scales = "free_x")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

idata %>% 
  filter( date >= as.Date('2018-01-01')& 
            date < as.Date('2020-01-01')) %>%
  ggplot( mapping = aes( x = monthday , y = ivalue)) + 
  geom_line() + 
  geom_point()  + 
  facet_grid(  year ~ month  )

2.2 data aggreated by intervals

# by year month
ggplot(idata ,  aes(yearmonth, ivalue)) +
  stat_summary(
    fun.y = sum, # adds up all observations for the month
    geom = "bar")

# by quarter 
idata %>% 
  filter( date >= as.Date('2018-01-01')& 
            date < as.Date('2020-01-01')) %>%
  ggplot( mapping = aes( x = monthday , y = ivalue)) + 
  stat_summary( fun.y = "mean", geom = 'line') + 
  facet_grid(  year ~ quarter  )

# weekday
idata %>% 
  filter( date >= as.Date('2018-01-01')& 
            date < as.Date('2020-01-01')) %>%
  ggplot() + 
  stat_summary( mapping = aes( x = as.numeric(weekday) , y = ivalue),
                fun.y = "median", geom = 'line')

2.3 more fancy plots

A combination of trend and seasonality, provided by TSstudio, a little usage of xts, other interesting plots, check

xts( idata$ivalue, order.by=idata$date ) %>% 
  .['2017/2019-12-31']  %>% 
  apply.monthly(. , sum) %>% 
  ts( ., start = c(2017,1,1), frequency = 12) %>% 
  ts_seasonal(., type = 'all')

# ts_heatmap(xdata)

very intersting heatmap by Sarang Gupta

idata %>% 
  filter( date >= as.Date('2018-01-01') & 
            date < as.Date('2020-01-01')) %>%
  ggplot( aes( monthweek, weekday , fill = ivalue)) + 
      scale_y_discrete( limits = rev( levels( idata$weekday))) +  # reverse workday
      geom_tile(colour = "white") + 
      facet_grid(year ~ month) + 
      scale_fill_gradient(low="red", high="green") + 
      xlab("Week of Month") + ylab("") + 
      ggtitle("Time-Series Calendar Heatmap:ivalue") + 
      labs(fill = "ivalue")

2.4 What we’ve learned?

As mentioned, trend, seasonality and volatility :

The trend has changed after 2017 and keep steady, so using data from 2018 and later should be enougth
Mulit-significant-seasonal-patterns, like different days in a week, different months in a year
Outliers are on the larger side(right skewed, positively tailed…), quite common with sales data(black fridays). So a log transformation or outlier manipulation might be helpful.

3.Data types,manipulatoin, investigation and preprocessing

The data infrastructure in r is dataframe, also in python and pyspark, this table-like data formation is wild-spread But when it comes the time series, you may be confused as I did about ts, xts, zoo objects with individuality, not easy to convert from/to each other and supported by different functions. Although the community is trying to change the situation, I have to say the learning curve is still steep. That is why I have decided to write this guild.

So I will give a brief about the data types and then provide an investigation on time-seires decompose, meanwhile, prepare the train and test set for modeling.

3.1 stats::ts

It’s a build-in class in r, basically is a vector with a time index:

not friendly for manipulation
- only ONE vector but not matrix, cannot handle data like OHLC
- subsetting by index or date are not supported, but forecast::subset is helpful
- other discussions aboutseconds-minutes and starting date
the mose important and puzzled parameter: frequency
- how many observations in one time interval,ep,days in a year 365
- can be 365.25 if you have a leap year
- only ONE mumber, so an annual data cannot have frequency represents 365 days in a year and 7 days in a week
a paradox of long term ts object(say, longer than 2 years):
- if freq=365.25/365.5, too long for stl to decompose it, this will effect ets,autoarima,and other functions
- if change to 7, all plot labels and index will go wrong
suggestion: use ts as a process variable when needed

ind_train <- seq(as.Date("2018-01-01"), as.Date("2020-01-05"), by = "day")
ind_test <- seq(as.Date("2020-01-06"), as.Date("2020-01-31"), by = "day")

train_ts <- ts( idata$ivalue[ idata$date >= min(ind_train) & 
                               idata$date <= max(ind_train)], 
                start = c( year(ind_train[1]), as.numeric(format(ind_train[1], "%j"))),
                           frequency = 365.5
                )

test_ts <- ts( idata$ivalue[ idata$date >= min(ind_test) & 
                               idata$date <= max(ind_test)], 
                start = c( year(ind_test[1]), as.numeric(format(ind_test[1], "%j"))),
                           frequency = 365.5
               )
ind_train[1:5]

## [1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05"

time(train_ts)[1:5] # the index is not in a date format

## [1] 2018.000 2018.003 2018.005 2018.008 2018.011

To decomposing a ts objects:

function decompose is deprecated, may got errorslike this
forecast::stl(Seasonal Decomposition of Time Series by LOESS), due to limitation of ts object(only have one frequency), it can only decompose seasaonlity into one factor. Obviously not good for our complicated data.
The bast practice is Prof Hyndman’s msts object

3.2 xts

structure: index + matrix, subclass of zoo
easy for manipulation see, and plot
not supported by many following-up functions
suggestion: use it for preprocessing

train_xts <- xts(train_ts, order.by = ind_train )
# xtsdata[1:5]
index(train_xts)[1:5]

## [1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05"

to.period(train_xts,period="years")# convert to OHLC

##            train_xts.Open train_xts.High train_xts.Low train_xts.Close
## 2018-12-31        7294302       12969003       1604512         4215499
## 2019-12-31        2737975       12516036       1185871         2680282
## 2020-01-05        2050793        4947452       1652882         4121744

train_xts["2019-01-30/2019-02-01"]  # subset

##                [,1]
## 2019-01-30  6453745
## 2019-01-31  6846035
## 2019-02-01 11917137

test_xts <- xts( test_ts, order.by = ind_test )

autoplot(cbind(train_xts, test_xts), facet = NULL)

3.3 forecast::msts

Multi-Seasonal Time Series type,see
decompose: use mstl() by define seasonal.periods in msts().
it is important to notice the vertical scales of different seaonality, for this data, the weekly, monthly and annually fluctuation are all important

istart <- c( year(ind_train[1]), as.numeric(format(ind_train[1], "%j")))

msts( as.numeric(train_xts), 
          start = istart,
          seasonal.periods = c(7,365.25) ) %>% 
  mstl(., lambda = 'auto') %>% autoplot()

msts( as.numeric(train_xts), 
          start = istart,
          seasonal.periods = c(7,30.5, 365.25) ) %>% 
  mstl(., lambda = 'auto') %>% autoplot()

train_msts <- msts( train_ts , 
                    start = c( year(ind_train[1]), as.numeric(format(ind_train[1], "%j"))),
                    seasonal.periods = c(7, 30.5, 365.25) )

3.4 dummy variables

One way to better fit is to use xreg, which is supported widely, to represent irregular black fridays. As mentioned, we have to prepare these dummy variables for both train and test.

train_holiday <- idata$holiday[ 
  idata$date >= min(ind_train) & idata$date <= max(ind_train)]

test_holiday <- idata$holiday[
  idata$date >= min(ind_test) & idata$date <= max(ind_test) ]

4.Modeling

Has been through all these, we are now ready to step into the most exciting part, here are 3 class of model: ets, arima and tbats.

For each, goes with a short explaination about the method, a section of coding and some additional remarks about the output. Again, I learned these from Forecasting:Principles and Practice, if you have time, read it.

4.1 ets: exponential models

the exponential model evolvement

Started from SES(Simple expoential smoothing): \(\hat{y_{t+1}} = ay_t + a(1-a)y_{t-1}+a(1-a)^2y_{t-2}...\)
Added a trend: Holt’s Liner/Damped with Trend
- \(\hat{y_{t+1}} = l_t + h*b_t\), l-level( just as a SES) + b-trend
- if the trend is linear: h=constant
- if the trend is damped(bended): \(h = \phi+\phi^2+...+\phi^h\)
Added a seasonality: Holt-Winters’s seasonal method
- l-level, t-trend, s-seasonal
- additive seaonal: l+b+s
- multiplicative seaonal : (l+b)*s
Exponential smoothing methods overview,ref
- Error{A,M},Trend{N,A,Ad},Seasonal{N,A,M}
- A: additive, Ad: additive damped, M:multiplicative
- for example, SES = Error{A},Trend{NULL},Seasonal{NULL}
- these methods are refered as State space models and in forecast as ETS

coding

# ?? forecast::ets() 
ets1_fc <- train_ts %>% 
  ets(.,damped =T, allow.multiplicative.trend = T  ) %>% 
  forecast( ., h = length(test_ts) )

## Warning in ets(., damped = T, allow.multiplicative.trend = T): I can't handle
## data with frequency greater than 24. Seasonality will be ignored. Try stlf() if
## you need seasonal forecasts.

checkresiduals( ets1_fc) # about

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,N)
## Q* = 256.06, df = 142, p-value = 1.464e-08
## 
## Model df: 5.   Total lags used: 147

test_forecast(forecast.obj = ets1_fc , 
              actual = rbind( train_xts, test_xts),  
              test = test_xts)

accuracy(ets1_fc, test_ts )

##                       ME    RMSE       MAE       MPE     MAPE      MASE
## Training set   -2752.191 1647855 1071050.8 -10.25152 27.31593 0.7107637
## Test set     -377599.900 1439462  989069.2 -20.40694 28.49301 0.6563596
##                   ACF1 Theil's U
## Training set 0.1154781        NA
## Test set     0.2362324  1.007458

# box-cox transformation 

ets2_fc <- train_ts %>% 
  ets(.,damped =T, allow.multiplicative.trend = T, lambda = 0 ) %>% 
  forecast( ., h = length(test_ts) )

## Warning in ets(., damped = T, allow.multiplicative.trend = T, lambda = 0): I
## can't handle data with frequency greater than 24. Seasonality will be ignored.
## Try stlf() if you need seasonal forecasts.

accuracy(ets2_fc, test_ts )

##                     ME    RMSE     MAE       MPE     MAPE      MASE       ACF1
## Training set  140657.5 1634750 1006885  -5.58690 24.57455 0.6681827 0.08292693
## Test set     -437356.1 1456232 1029709 -22.23896 29.96143 0.6833290 0.23616507
##              Theil's U
## Training set        NA
## Test set       1.02463

about Box-Cox transformation or parameter ‘lambda’,‘biasadj’

train function parameter:lambda,NULL(default),‘auto’ or 0(log)
biasadj True = mean, False = median, if aggregation is needed, then set True
see

about checkresiduals()

criteria : should be normally distributed with mean zero & constant variance

about accuracy()

ME,RMSE…
- M-mean, R-root,S-Square, A-absolute,P-Percentage,E-error
- which to use
ACF1: Autocorrelation of errors at lag 1
Theil’s U on testset, sth like \(\frac{\hat{y_{t+1}} - y_{t+1}}{y_t}\) divided by \(\frac{y_{t+1} - y_t}{y_t}\), should be less than 1,see

4.2 arirma

basic method( since there is auto.arima, I don’t want to talk much about it)

check the stationary, acf and pacf to determine parameters
seasonal is like ARIMA(p,d,q)(P,D,Q)m
auto.arima save the world
- based on AIC, discussion
- seasnonal determined by nsdiffs, but also, can only deal with one seasnality

coding

## auto.arima application
arima1_fc <- train_ts %>% 
  auto.arima( . , max.p = 3, max.q = 2, max.d = 1) %>% 
  forecast( . , h = length(test_ts))  

# autoplot(arima1_fc ) 
test_forecast(forecast.obj = arima1_fc , 
              actual = rbind( train_xts, test_xts),  
              test = test_xts)

checkresiduals( arima1_fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)
## Q* = 296.5, df = 145, p-value = 1.843e-12
## 
## Model df: 2.   Total lags used: 147

accuracy(arima1_fc, test_ts )

##                     ME    RMSE       MAE        MPE     MAPE      MASE
## Training set -105875.0 1558318 1068148.2 -16.231779 30.54475 0.7088375
## Test set      200350.5 1403456  748522.1  -2.684243 18.49179 0.4967294
##                    ACF1 Theil's U
## Training set 0.05782062        NA
## Test set     0.23626598 0.9331335

## auto.arima with xreg and log transformation

arima2_fc <- train_ts %>%
  auto.arima(., max.p = 3, max.q = 2, max.d = 1, xreg = train_holiday, lambda = 0) %>% 
  forecast(., xreg = test_holiday,  h = length(test_ts)) 
accuracy(arima2_fc, test_ts )

##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 154797.2 1570016 992729.5 -7.999977 26.14839 0.6587887 0.07341816
## Test set     501398.0 1479477 779229.0  6.535402 17.75847 0.5171068 0.23841939
##              Theil's U
## Training set        NA
## Test set     0.9683756

test_forecast(forecast.obj = arima2_fc , 
              actual = rbind( train_xts, test_xts),  
              test = test_xts)

Regression with ARIMA errors

why bother?
- arima is not good for multi-seasonality ( can also be solved by msts->tbats)
- non-period events such as Easter or the Chinese New Year
regression model with ARIMA errorsmore :
- holiday effects : dummy xreg for train and test
- longer seasonality : Fourier transformation
- short-term dynamics handled by an ARMA
about fourier( ts, K, h)
- K, the number of Fourier sin and cos pairs
- more

coding : Regression with ARIMA errors

## Regression with ARIMA errors 

# determin the best K for fourier
Theil_u <- vector()
for ( i in seq(6)) { 
  z <- fourier( train_ts, K=i)
  zf <- fourier( test_ts, K=i,
                 h=length(test_ts) )
  regarima <- auto.arima( train_ts , seasonal =F,
                          xreg=cbind(z,hol=train_holiday) )
  regarima_fc <- forecast(regarima, 
                          xreg=cbind(zf,hol=test_holiday), 
                          h= length(test_ts) )
  Theil_u[i] <- accuracy(regarima_fc, test_ts )[2,8]
}
bestK <- which.min(Theil_u)
rm(z, zf, regarima, regarima_fc,i,Theil_u )

# or by AIC
# bestfit <- list(aicc=Inf)
# for(K in seq(25)) {
#   fit <- auto.arima(train_ts,
#                     xreg=cbind(z,hol=train_holiday),
#                     seasonal=FALSE)
#   if(fit[["aicc"]] < bestfit[["aicc"]]) {
#     bestfit <- fit
#     bestK <- K
#   }
# }
# bestK

z <- fourier( train_ts, K= bestK ) 
zf <- fourier( test_ts, K= bestK , h=length(test_ts) )
arima3_fc <- train_ts %>% 
  auto.arima( . , xreg=cbind(z,hol=train_holiday), seasonal=F, lambda = 0) %>% 
  forecast(., xreg=cbind(zf,hol=test_holiday), h= length(test_ts) )
accuracy(arima3_fc, test_ts )

##                     ME    RMSE       MAE        MPE     MAPE      MASE
## Training set  136455.5 1634804 1005262.3  -5.679592 24.56103 0.6671056
## Test set     -103138.5 1389068  806554.5 -11.962152 21.94878 0.5352404
##                    ACF1 Theil's U
## Training set 0.08498273        NA
## Test set     0.23645116 0.9489268

checkresiduals( arima3_fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,1,1) errors
## Q* = 274.4, df = 136, p-value = 2.194e-11
## 
## Model df: 11.   Total lags used: 147

test_forecast(forecast.obj = arima3_fc , 
              actual = rbind( train_xts, test_xts),  
              test = test_xts)

4.3 tbats

method

daily time series longer than a year, contain annual/weekly seasonality
Complexity of data
- multiple seasonal periods
- non-integer seasonality(leap year)
- dual-calendar effects: the lunar calendar
TBATS framework(Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components)
- Box-Cox transformations: normality
- Fourier series with time varying coefficients
- ARMA error correction

coding

## simple 
bat1 <- tbats( train_msts)
# print(bat1)
bat1_fc <- forecast(bat1, h = length(test_ts) )
accuracy(bat1_fc, test_ts )

##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 115591.3 1461810 876010.0 -6.714077 22.44940 0.5597782 -0.0124778
## Test set     229561.5 1547009 904024.2 -2.691073 23.43636 0.5776795  0.2485225
##              Theil's U
## Training set        NA
## Test set      1.038621

## with cox-box transformation and xreg
bat2_fc <- train_msts %>%
  tbats(., lambda=0,  xreg = train_holiday ) %>% 
  forecast( ., h = length(test_ts) )
accuracy(bat2_fc, test_ts )

##                       ME    RMSE      MAE       MPE     MAPE      MASE
## Training set 114892.0995 1466989 878695.4 -6.715136 22.47679 0.5614942
## Test set        508.6969 1498184 906200.9 -9.716975 24.72468 0.5790704
##                    ACF1 Theil's U
## Training set -0.0334446        NA
## Test set      0.2155837  1.018188

checkresiduals(bat2_fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from TBATS(0.064, {1,0}, -, {<7,3>, <30.5,6>, <365.25,3>})
## Q* = 180.44, df = 112, p-value = 4.44e-05
## 
## Model df: 35.   Total lags used: 147

test_forecast(forecast.obj = bat2_fc , 
              actual = rbind( train_xts, test_xts),  
              test = test_xts)

4.4 pipelines for different modeling

ets: ts -> forecast::ets -> test:forecast,accuracy,checkresiduals
arima(auto.arima): ts/vector -> tseries::adf.test -> auto.arima, autoplot(AR roots) -> test:fore,acc,checkres
tbats(multiple seasonal state space model): forecast::msts -> tbats -> test…
regression with arima errors:ts + forecast::fourier + xreg -> auto.arima -> test…

4.5 other methods and issues

tslm,Time Series Linear Model, didn’t work well for me
backtesting using TSstudio

5.comparisoin

Yes, we have used the accuracy on all the models. But are the sample representing well enough? Since we have long-term data, it’s better to test through all history and get an average performance of each model, or so called backtesting.

Here is an example, with ugly coding, though.

5.1 data preperation

inds <- seq( min(idata$date), max(idata$date), by = "day")
tsdata <- ts( idata$ivalue,  
            start = c( year(inds[1]), as.numeric(format(inds[1], "%j"))),
           frequency = 365.25 )
ts_info(tsdata)

##  The tsdata series is a ts object with 1 variable and 1257 observations
##  Frequency: 365.25 
##  Start time: 2016.66803559206 
##  End time: 2020.1067761807

mstsdata <- msts(tsdata, seasonal.periods=c(7, 365.25)) #

5.2 backtesting loop function

tr = 100 # length of train set 
te = 20 # length of test and predict set
step = 30 # the window between two train sets
k = bestK # best K for fourier transformation 

ts_model_backtesting <- function( tr, te, step, k = bestK   ) {
   
  icyc <- floor((  length(tsdata)- tr-te)/step)

  perf <- data.frame()
  
  for ( i in 1: icyc ) {
    
    # data preprocessing 
    tstrain <- subset( tsdata, start = step*i, end = (step*i+tr-1)  ) 
    tstest <- subset( tsdata, start = (step*i+tr) , end = (step*i+tr+te-1) )
    tagtran <- idata$holiday [ (step*i) : (step*i+tr-1)]
    tagtest <- idata$holiday [ (step*i+tr): (step*i+tr+te-1)]
    
    z <- fourier( tstrain, K= bestK )
    zf <- fourier( tstest, K= bestK )
    
    # ets 
    ssets <- ets( tstrain
                  , damped =T
                  , allow.multiplicative.trend = T
                  , lambda = 0
                  )
    fets <- forecast( ssets, h= te )%>% 
        accuracy(., tagtest) %>% .[2,2]
    
    # arima 
    sarima <- try( auto.arima( 
      tstrain , lambda = 0, seasonal=TRUE ,xreg=cbind(z,hol=tagtran)))
                      
    farima <- if( class(sarima)[1] !='try-error'  ) {
      forecast(sarima,xreg=cbind(zf,hol=tagtest), h= te,biasadj =F ) %>% 
        accuracy(., tagtest ) %>% .[2,2]
    } else {NA}
    
    #tbats
    sbats <- try( tbats( 
      subset(mstsdata, start = (step*i), end = (step*i+tr-1) ),
      xreg = tagtran, lambda =0 ) )
    
    fbats <- if( class(sbats)[1] !='try-error'  ) {
      forecast(sbats, xreg= tagtest, h= te ) %>% 
        accuracy(., tagtest ) %>% .[2,2]
    } else {NA}
    
    a <- data.frame(
      round = i , 
      rmse_ets = fets, 
      rmse_arima =  farima, 
      rmse_bats =   fbats )
    
    perf <- rbind(a, perf )
  }

  return( perf)
}

5.3 the performance of models with a short-term trainning and a long term trainning

perf_quarter  <- ts_model_backtesting( tr =100, te =20, step =30 )

## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) : 
##   No suitable ARIMA model found

summary(perf_quarter)

##      round       rmse_ets         rmse_arima        rmse_bats      
##  Min.   : 1   Min.   :1780194   Min.   :2618949   Min.   :1940536  
##  1st Qu.:10   1st Qu.:2761889   1st Qu.:3637561   1st Qu.:3214171  
##  Median :19   Median :3821193   Median :4237082   Median :3749071  
##  Mean   :19   Mean   :4336923   Mean   :5077623   Mean   :4343436  
##  3rd Qu.:28   3rd Qu.:5805635   3rd Qu.:6639816   3rd Qu.:5266380  
##  Max.   :37   Max.   :9436381   Max.   :8993253   Max.   :9268679  
##                                                   NA's   :9

perf_annuel  <- ts_model_backtesting( tr =380, te =20, step =30 )
summary(perf_annuel)

##      round          rmse_ets         rmse_arima        rmse_bats      
##  Min.   : 1.00   Min.   :1947776   Min.   :2833904   Min.   :2552970  
##  1st Qu.: 7.75   1st Qu.:3428083   1st Qu.:3564460   1st Qu.:3559865  
##  Median :14.50   Median :3827823   Median :3907917   Median :3682509  
##  Mean   :14.50   Mean   :3954214   Mean   :4178775   Mean   :3804682  
##  3rd Qu.:21.25   3rd Qu.:4415535   3rd Qu.:4718107   3rd Qu.:3942935  
##  Max.   :28.00   Max.   :7137421   Max.   :6744029   Max.   :5808713

ref

https://uc-r.github.io/ts_exploration
https://otexts.com/fpp2/
https://robjhyndman.com/publications/complex-seasonality/
https://towardsdatascience.com/time-series-calendar-heatmaps-9f576578fcfe
https://robjhyndman.com/hyndsight/dailydata/
https://robjhyndman.com/hyndsight/forecast7-part-2/

Time Series Forecasting, a Summarize of Data visualization and prediction

Ted Ding

2020/2/22

IF you are planning to do some thing with time-seirse data and …

1 Data, packages and manipulation

packages

a glimpse of the rawdata

data manipulation

2.visualization

2.1 data divided by intervals

2.2 data aggreated by intervals

2.3 more fancy plots

2.4 What we’ve learned?

3.Data types,manipulatoin, investigation and preprocessing

3.1 stats::ts

3.2 xts

3.3 forecast::msts

3.4 dummy variables

4.Modeling

4.1 ets: exponential models

the exponential model evolvement

coding

about Box-Cox transformation or parameter ‘lambda’,‘biasadj’

about checkresiduals()

about accuracy()

4.2 arirma

basic method( since there is auto.arima, I don’t want to talk much about it)

coding

Regression with ARIMA errors

coding : Regression with ARIMA errors

4.3 tbats

method

coding

4.4 pipelines for different modeling

4.5 other methods and issues

5.comparisoin

5.1 data preperation

5.2 backtesting loop function

5.3 the performance of models with a short-term trainning and a long term trainning

ref