then read this, I will introduce:
By these, you can get a whole map of how to fit own your model to your own data, so let’s get start!
rawdata: a dataframe with 2 columns, date and value, we will mainly deal with, to train and to predict.
## date ivalue
## Min. :2016-09-01 Min. : 1185871
## 1st Qu.:2017-07-12 1st Qu.: 3107670
## Median :2018-05-22 Median : 4418672
## Mean :2018-05-22 Mean : 5041120
## 3rd Qu.:2019-04-01 3rd Qu.: 6565313
## Max. :2020-02-09 Max. :42395094
holiday_date: irregular events with date, can be used to improve performance of perdiction. Events like national day or lunar new year in China may have a great impact on the value. To be noticed, you should have this tags for both train and predict period, to aquire data , try to google ‘holiday date api’ .
head(holiday_date)
## dt holiday
## 1 2020-01-01 2
## 2 2020-01-24 1
## 3 2020-01-25 2
## 4 2020-01-26 2
## 5 2020-01-27 2
## 6 2020-01-28 1
# extract all kinds of time tag, year, month, quarter ...
rawdata$year <- format( rawdata$date,"%Y")
rawdata$quarter <- quarters( rawdata$date, abbreviate = T )
rawdata$month <- format( rawdata$date, "%m")
# months of the year, days of the week etc
# yearmonth, month of the year . Eg: Nov 2018
rawdata$yearmonth<- factor( as.yearmon( rawdata$date))
# days of the week
rawdata$weekday <- weekdays( rawdata$date, abbreviate = T )
rawdata$weekday <- factor(rawdata$weekday ,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
# days of the month
rawdata$monthday <- as.numeric( format(rawdata$date, "%d"))
# weeks of the year
rawdata$yearweek <- as.numeric(format(rawdata$date,"%W"))
# week of the month, normalizing the week to start at 1 for every month
rawdata<-ddply(rawdata,.(yearmonth),transform,monthweek=1+yearweek-min(yearweek))
# add holiday as binary
holiday_date$dt <- as.Date( holiday_date$dt)
idata <- merge( rawdata, holiday_date ,
by.x ='date', by.y = 'dt', all.x = T )
idata$holiday <- ifelse( is.na(idata$holiday), 0,1 )
head(idata)
## date ivalue year quarter month yearmonth weekday monthday yearweek
## 1 2016-09-01 8260742 2016 Q3 09 Sep 2016 Thu 1 35
## 2 2016-09-02 8523553 2016 Q3 09 Sep 2016 Fri 2 35
## 3 2016-09-03 9035602 2016 Q3 09 Sep 2016 Sat 3 35
## 4 2016-09-04 7904068 2016 Q3 09 Sep 2016 Sun 4 35
## 5 2016-09-05 8158248 2016 Q3 09 Sep 2016 Mon 5 36
## 6 2016-09-06 6378481 2016 Q3 09 Sep 2016 Tue 6 36
## monthweek holiday
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 2 0
## 6 2 0
# summary(idata)
By drawing some plots, we could get some sense about trend, seasonality and irregular volatility of the data and this will make modeling more effectively. Here is one of my principles in DS: no black magic, good models always make sense.
There are ways to generate plot from ts/xts objects directly, but simple using dataframe and ggplot2 is already good to go.
Three elements of plotting:
1.the numeric value
2.all kinds of time intervals( years, months, days in a week… )
3.organize them
Here are some examples :
# distribution
ggplot( idata, aes( x=ivalue )) +
geom_density( ) +
geom_vline(aes(xintercept=median(ivalue)),
color="blue", linetype="dashed", size=1)
# time serise divided by intervals( year, month , etc )
idata %>%
filter( ivalue <= quantile(idata$ivalue, 0.975) ) %>%
ggplot(aes(x=date, y=ivalue)) +
geom_jitter(alpha=0.5) +
geom_smooth() +
ggtitle("yearly pattern, outlier removed") +
facet_wrap( ~ year, scales = "free_x")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
idata %>%
filter( date >= as.Date('2018-01-01')&
date < as.Date('2020-01-01')) %>%
ggplot( mapping = aes( x = monthday , y = ivalue)) +
geom_line() +
geom_point() +
facet_grid( year ~ month )
# by year month
ggplot(idata , aes(yearmonth, ivalue)) +
stat_summary(
fun.y = sum, # adds up all observations for the month
geom = "bar")
# by quarter
idata %>%
filter( date >= as.Date('2018-01-01')&
date < as.Date('2020-01-01')) %>%
ggplot( mapping = aes( x = monthday , y = ivalue)) +
stat_summary( fun.y = "mean", geom = 'line') +
facet_grid( year ~ quarter )
# weekday
idata %>%
filter( date >= as.Date('2018-01-01')&
date < as.Date('2020-01-01')) %>%
ggplot() +
stat_summary( mapping = aes( x = as.numeric(weekday) , y = ivalue),
fun.y = "median", geom = 'line')
A combination of trend and seasonality, provided by TSstudio, a little usage of xts, other interesting plots, check
xts( idata$ivalue, order.by=idata$date ) %>%
.['2017/2019-12-31'] %>%
apply.monthly(. , sum) %>%
ts( ., start = c(2017,1,1), frequency = 12) %>%
ts_seasonal(., type = 'all')
# ts_heatmap(xdata)
very intersting heatmap by Sarang Gupta
idata %>%
filter( date >= as.Date('2018-01-01') &
date < as.Date('2020-01-01')) %>%
ggplot( aes( monthweek, weekday , fill = ivalue)) +
scale_y_discrete( limits = rev( levels( idata$weekday))) + # reverse workday
geom_tile(colour = "white") +
facet_grid(year ~ month) +
scale_fill_gradient(low="red", high="green") +
xlab("Week of Month") + ylab("") +
ggtitle("Time-Series Calendar Heatmap:ivalue") +
labs(fill = "ivalue")
As mentioned, trend, seasonality and volatility :
The data infrastructure in r is dataframe, also in python and pyspark, this table-like data formation is wild-spread But when it comes the time series, you may be confused as I did about ts, xts, zoo objects with individuality, not easy to convert from/to each other and supported by different functions. Although the community is trying to change the situation, I have to say the learning curve is still steep. That is why I have decided to write this guild.
So I will give a brief about the data types and then provide an investigation on time-seires decompose, meanwhile, prepare the train and test set for modeling.
It’s a build-in class in r, basically is a vector with a time index:
ind_train <- seq(as.Date("2018-01-01"), as.Date("2020-01-05"), by = "day")
ind_test <- seq(as.Date("2020-01-06"), as.Date("2020-01-31"), by = "day")
train_ts <- ts( idata$ivalue[ idata$date >= min(ind_train) &
idata$date <= max(ind_train)],
start = c( year(ind_train[1]), as.numeric(format(ind_train[1], "%j"))),
frequency = 365.5
)
test_ts <- ts( idata$ivalue[ idata$date >= min(ind_test) &
idata$date <= max(ind_test)],
start = c( year(ind_test[1]), as.numeric(format(ind_test[1], "%j"))),
frequency = 365.5
)
ind_train[1:5]
## [1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05"
time(train_ts)[1:5] # the index is not in a date format
## [1] 2018.000 2018.003 2018.005 2018.008 2018.011
To decomposing a ts objects:
train_xts <- xts(train_ts, order.by = ind_train )
# xtsdata[1:5]
index(train_xts)[1:5]
## [1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05"
to.period(train_xts,period="years")# convert to OHLC
## train_xts.Open train_xts.High train_xts.Low train_xts.Close
## 2018-12-31 7294302 12969003 1604512 4215499
## 2019-12-31 2737975 12516036 1185871 2680282
## 2020-01-05 2050793 4947452 1652882 4121744
train_xts["2019-01-30/2019-02-01"] # subset
## [,1]
## 2019-01-30 6453745
## 2019-01-31 6846035
## 2019-02-01 11917137
test_xts <- xts( test_ts, order.by = ind_test )
autoplot(cbind(train_xts, test_xts), facet = NULL)
istart <- c( year(ind_train[1]), as.numeric(format(ind_train[1], "%j")))
msts( as.numeric(train_xts),
start = istart,
seasonal.periods = c(7,365.25) ) %>%
mstl(., lambda = 'auto') %>% autoplot()
msts( as.numeric(train_xts),
start = istart,
seasonal.periods = c(7,30.5, 365.25) ) %>%
mstl(., lambda = 'auto') %>% autoplot()
train_msts <- msts( train_ts ,
start = c( year(ind_train[1]), as.numeric(format(ind_train[1], "%j"))),
seasonal.periods = c(7, 30.5, 365.25) )
One way to better fit is to use xreg, which is supported widely, to represent irregular black fridays. As mentioned, we have to prepare these dummy variables for both train and test.
train_holiday <- idata$holiday[
idata$date >= min(ind_train) & idata$date <= max(ind_train)]
test_holiday <- idata$holiday[
idata$date >= min(ind_test) & idata$date <= max(ind_test) ]
Has been through all these, we are now ready to step into the most exciting part, here are 3 class of model: ets, arima and tbats.
For each, goes with a short explaination about the method, a section of coding and some additional remarks about the output. Again, I learned these from Forecasting:Principles and Practice, if you have time, read it.
# ?? forecast::ets()
ets1_fc <- train_ts %>%
ets(.,damped =T, allow.multiplicative.trend = T ) %>%
forecast( ., h = length(test_ts) )
## Warning in ets(., damped = T, allow.multiplicative.trend = T): I can't handle
## data with frequency greater than 24. Seasonality will be ignored. Try stlf() if
## you need seasonal forecasts.
checkresiduals( ets1_fc) # about
##
## Ljung-Box test
##
## data: Residuals from ETS(A,Ad,N)
## Q* = 256.06, df = 142, p-value = 1.464e-08
##
## Model df: 5. Total lags used: 147
test_forecast(forecast.obj = ets1_fc ,
actual = rbind( train_xts, test_xts),
test = test_xts)
accuracy(ets1_fc, test_ts )
## ME RMSE MAE MPE MAPE MASE
## Training set -2752.191 1647855 1071050.8 -10.25152 27.31593 0.7107637
## Test set -377599.900 1439462 989069.2 -20.40694 28.49301 0.6563596
## ACF1 Theil's U
## Training set 0.1154781 NA
## Test set 0.2362324 1.007458
# box-cox transformation
ets2_fc <- train_ts %>%
ets(.,damped =T, allow.multiplicative.trend = T, lambda = 0 ) %>%
forecast( ., h = length(test_ts) )
## Warning in ets(., damped = T, allow.multiplicative.trend = T, lambda = 0): I
## can't handle data with frequency greater than 24. Seasonality will be ignored.
## Try stlf() if you need seasonal forecasts.
accuracy(ets2_fc, test_ts )
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 140657.5 1634750 1006885 -5.58690 24.57455 0.6681827 0.08292693
## Test set -437356.1 1456232 1029709 -22.23896 29.96143 0.6833290 0.23616507
## Theil's U
## Training set NA
## Test set 1.02463
## auto.arima application
arima1_fc <- train_ts %>%
auto.arima( . , max.p = 3, max.q = 2, max.d = 1) %>%
forecast( . , h = length(test_ts))
# autoplot(arima1_fc )
test_forecast(forecast.obj = arima1_fc ,
actual = rbind( train_xts, test_xts),
test = test_xts)
checkresiduals( arima1_fc)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)
## Q* = 296.5, df = 145, p-value = 1.843e-12
##
## Model df: 2. Total lags used: 147
accuracy(arima1_fc, test_ts )
## ME RMSE MAE MPE MAPE MASE
## Training set -105875.0 1558318 1068148.2 -16.231779 30.54475 0.7088375
## Test set 200350.5 1403456 748522.1 -2.684243 18.49179 0.4967294
## ACF1 Theil's U
## Training set 0.05782062 NA
## Test set 0.23626598 0.9331335
## auto.arima with xreg and log transformation
arima2_fc <- train_ts %>%
auto.arima(., max.p = 3, max.q = 2, max.d = 1, xreg = train_holiday, lambda = 0) %>%
forecast(., xreg = test_holiday, h = length(test_ts))
accuracy(arima2_fc, test_ts )
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 154797.2 1570016 992729.5 -7.999977 26.14839 0.6587887 0.07341816
## Test set 501398.0 1479477 779229.0 6.535402 17.75847 0.5171068 0.23841939
## Theil's U
## Training set NA
## Test set 0.9683756
test_forecast(forecast.obj = arima2_fc ,
actual = rbind( train_xts, test_xts),
test = test_xts)
## Regression with ARIMA errors
# determin the best K for fourier
Theil_u <- vector()
for ( i in seq(6)) {
z <- fourier( train_ts, K=i)
zf <- fourier( test_ts, K=i,
h=length(test_ts) )
regarima <- auto.arima( train_ts , seasonal =F,
xreg=cbind(z,hol=train_holiday) )
regarima_fc <- forecast(regarima,
xreg=cbind(zf,hol=test_holiday),
h= length(test_ts) )
Theil_u[i] <- accuracy(regarima_fc, test_ts )[2,8]
}
bestK <- which.min(Theil_u)
rm(z, zf, regarima, regarima_fc,i,Theil_u )
# or by AIC
# bestfit <- list(aicc=Inf)
# for(K in seq(25)) {
# fit <- auto.arima(train_ts,
# xreg=cbind(z,hol=train_holiday),
# seasonal=FALSE)
# if(fit[["aicc"]] < bestfit[["aicc"]]) {
# bestfit <- fit
# bestK <- K
# }
# }
# bestK
z <- fourier( train_ts, K= bestK )
zf <- fourier( test_ts, K= bestK , h=length(test_ts) )
arima3_fc <- train_ts %>%
auto.arima( . , xreg=cbind(z,hol=train_holiday), seasonal=F, lambda = 0) %>%
forecast(., xreg=cbind(zf,hol=test_holiday), h= length(test_ts) )
accuracy(arima3_fc, test_ts )
## ME RMSE MAE MPE MAPE MASE
## Training set 136455.5 1634804 1005262.3 -5.679592 24.56103 0.6671056
## Test set -103138.5 1389068 806554.5 -11.962152 21.94878 0.5352404
## ACF1 Theil's U
## Training set 0.08498273 NA
## Test set 0.23645116 0.9489268
checkresiduals( arima3_fc)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,1,1) errors
## Q* = 274.4, df = 136, p-value = 2.194e-11
##
## Model df: 11. Total lags used: 147
test_forecast(forecast.obj = arima3_fc ,
actual = rbind( train_xts, test_xts),
test = test_xts)
## simple
bat1 <- tbats( train_msts)
# print(bat1)
bat1_fc <- forecast(bat1, h = length(test_ts) )
accuracy(bat1_fc, test_ts )
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 115591.3 1461810 876010.0 -6.714077 22.44940 0.5597782 -0.0124778
## Test set 229561.5 1547009 904024.2 -2.691073 23.43636 0.5776795 0.2485225
## Theil's U
## Training set NA
## Test set 1.038621
## with cox-box transformation and xreg
bat2_fc <- train_msts %>%
tbats(., lambda=0, xreg = train_holiday ) %>%
forecast( ., h = length(test_ts) )
accuracy(bat2_fc, test_ts )
## ME RMSE MAE MPE MAPE MASE
## Training set 114892.0995 1466989 878695.4 -6.715136 22.47679 0.5614942
## Test set 508.6969 1498184 906200.9 -9.716975 24.72468 0.5790704
## ACF1 Theil's U
## Training set -0.0334446 NA
## Test set 0.2155837 1.018188
checkresiduals(bat2_fc)
##
## Ljung-Box test
##
## data: Residuals from TBATS(0.064, {1,0}, -, {<7,3>, <30.5,6>, <365.25,3>})
## Q* = 180.44, df = 112, p-value = 4.44e-05
##
## Model df: 35. Total lags used: 147
test_forecast(forecast.obj = bat2_fc ,
actual = rbind( train_xts, test_xts),
test = test_xts)
Yes, we have used the accuracy on all the models. But are the sample representing well enough? Since we have long-term data, it’s better to test through all history and get an average performance of each model, or so called backtesting.
Here is an example, with ugly coding, though.
inds <- seq( min(idata$date), max(idata$date), by = "day")
tsdata <- ts( idata$ivalue,
start = c( year(inds[1]), as.numeric(format(inds[1], "%j"))),
frequency = 365.25 )
ts_info(tsdata)
## The tsdata series is a ts object with 1 variable and 1257 observations
## Frequency: 365.25
## Start time: 2016.66803559206
## End time: 2020.1067761807
mstsdata <- msts(tsdata, seasonal.periods=c(7, 365.25)) #
tr = 100 # length of train set
te = 20 # length of test and predict set
step = 30 # the window between two train sets
k = bestK # best K for fourier transformation
ts_model_backtesting <- function( tr, te, step, k = bestK ) {
icyc <- floor(( length(tsdata)- tr-te)/step)
perf <- data.frame()
for ( i in 1: icyc ) {
# data preprocessing
tstrain <- subset( tsdata, start = step*i, end = (step*i+tr-1) )
tstest <- subset( tsdata, start = (step*i+tr) , end = (step*i+tr+te-1) )
tagtran <- idata$holiday [ (step*i) : (step*i+tr-1)]
tagtest <- idata$holiday [ (step*i+tr): (step*i+tr+te-1)]
z <- fourier( tstrain, K= bestK )
zf <- fourier( tstest, K= bestK )
# ets
ssets <- ets( tstrain
, damped =T
, allow.multiplicative.trend = T
, lambda = 0
)
fets <- forecast( ssets, h= te )%>%
accuracy(., tagtest) %>% .[2,2]
# arima
sarima <- try( auto.arima(
tstrain , lambda = 0, seasonal=TRUE ,xreg=cbind(z,hol=tagtran)))
farima <- if( class(sarima)[1] !='try-error' ) {
forecast(sarima,xreg=cbind(zf,hol=tagtest), h= te,biasadj =F ) %>%
accuracy(., tagtest ) %>% .[2,2]
} else {NA}
#tbats
sbats <- try( tbats(
subset(mstsdata, start = (step*i), end = (step*i+tr-1) ),
xreg = tagtran, lambda =0 ) )
fbats <- if( class(sbats)[1] !='try-error' ) {
forecast(sbats, xreg= tagtest, h= te ) %>%
accuracy(., tagtest ) %>% .[2,2]
} else {NA}
a <- data.frame(
round = i ,
rmse_ets = fets,
rmse_arima = farima,
rmse_bats = fbats )
perf <- rbind(a, perf )
}
return( perf)
}
perf_quarter <- ts_model_backtesting( tr =100, te =20, step =30 )
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
## Error in auto.arima(as.numeric(first.model$errors), d = 0, ...) :
## No suitable ARIMA model found
summary(perf_quarter)
## round rmse_ets rmse_arima rmse_bats
## Min. : 1 Min. :1780194 Min. :2618949 Min. :1940536
## 1st Qu.:10 1st Qu.:2761889 1st Qu.:3637561 1st Qu.:3214171
## Median :19 Median :3821193 Median :4237082 Median :3749071
## Mean :19 Mean :4336923 Mean :5077623 Mean :4343436
## 3rd Qu.:28 3rd Qu.:5805635 3rd Qu.:6639816 3rd Qu.:5266380
## Max. :37 Max. :9436381 Max. :8993253 Max. :9268679
## NA's :9
perf_annuel <- ts_model_backtesting( tr =380, te =20, step =30 )
summary(perf_annuel)
## round rmse_ets rmse_arima rmse_bats
## Min. : 1.00 Min. :1947776 Min. :2833904 Min. :2552970
## 1st Qu.: 7.75 1st Qu.:3428083 1st Qu.:3564460 1st Qu.:3559865
## Median :14.50 Median :3827823 Median :3907917 Median :3682509
## Mean :14.50 Mean :3954214 Mean :4178775 Mean :3804682
## 3rd Qu.:21.25 3rd Qu.:4415535 3rd Qu.:4718107 3rd Qu.:3942935
## Max. :28.00 Max. :7137421 Max. :6744029 Max. :5808713