Time series forecast with stochastic gradient boosting

Time series analysis are aimed to extract meaningful statistics out of the time series data and to make forecast of the future trends based on previous and current observations. Prominent characteristics of time series is that they have its own internal structure like auto-correlation, trend or seasonal variation that should be accounted for. For example, sales of a company can rapidly grow over years but they still follow consistent seasonal patterns (with the picks in December and low seasonality in August). This document analyses daily sales data for a period of two and a half years. The goal is to identify patterns and to make forecast for the last two weeks (15 days). Predictive algorithm used for forecasting is stochastic gradient boosting. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of the ensemble of weak prediction models, typically decision trees. However this ensemble of weak predictors when combined together gives a strong model with superior accuracy. Stochastic gradient boosting is very popular method for making predictive models, but in time series analysis its use is still pioneering.

Exploratory analysis

In order to get insight into the nature of the data and to identify hidden patterns exploratory data analysis are performed. First, we read the .csv file and visualize the original time series to see their behavior.

library(ggplot2)
library(TTR)
library(gbm)

train1 <- read.csv("training1_new.csv", header = TRUE, sep = ",")
head(train1)

##       TSDate sales
## 1 2013-06-21  1248
## 2 2013-06-22     0
## 3 2013-06-23     0
## 4 2013-06-24  3393
## 5 2013-06-25  1767
## 6 2013-06-26  1663

dim(train1)

## [1] 878   2

The graph shows seasonal pattern with high seasonality round Christmas, upward trend and daily and weekly frequency. To make model that fits to the real nature of the data we introduce a few dummy variables, like days in the week, weekdays, weekends, month, year, number of the week seasonality with high season before Christmas and see what is the relation between the sales and given period of the year, month, week…

train1$days <- weekdays(as.Date(train1$TSDate))
train1$weeks <- ifelse(train1$days == "Saturday" | train1$days == "Sunday", "Weekend", "Weekday")
train1$month <- as.numeric(format(as.Date(train1$TSDate), "%m"))
train1$year <- as.numeric(format(as.Date(train1$TSDate), "%Y"))
train1$weeknum <- as.POSIXlt(train1$TSDate)
train1$weeknum <- strftime(train1$weeknum,format="%W")

train1$seasons <- ifelse(as.Date(train1$TSDate) %in% seq(as.Date('2013-12-01'),as.Date('2013-12-23'),by = 1) | as.Date(train1$TSDate) %in% seq(as.Date('2014-12-01'),as.Date('2014-12-23'),by = 1) | as.Date(train1$TSDate) %in% seq(as.Date('2014-06-23'),as.Date('2014-07-07'),by = 1) | as.Date(train1$TSDate) %in% seq(as.Date('2015-06-23'),as.Date('2015-07-07'),by = 1), "high season", "low season")

head(train1)

##       TSDate sales      days   weeks month year weeknum    seasons
## 1 2013-06-21  1248    Friday Weekday     6 2013      24 low season
## 2 2013-06-22     0  Saturday Weekend     6 2013      24 low season
## 3 2013-06-23     0    Sunday Weekend     6 2013      24 low season
## 4 2013-06-24  3393    Monday Weekday     6 2013      25 low season
## 5 2013-06-25  1767   Tuesday Weekday     6 2013      25 low season
## 6 2013-06-26  1663 Wednesday Weekday     6 2013      25 low season

To discover the hidden pattern and see the daily and weekly fluctuation of the sales we consider each day of the week as a separate factor and plot sales separately for each day of the week respective to the number of the week for the entire period of of two and a half years.

From the graph we can see daily and weekly frequency with highest sales in Monday and lowest in Saturday. As it was mentioned before the future values depends of the past values, what means that today sales depend of the sales in the previous period, and to get proper forecast this relation should be accounted for. For this purpose we introduced a new variable the moving averages of the past 15 days. Moving averages or in our case simply the average price from the last 15 days, is a widely used indicator in technical analysis that helps smooth out the forecast by filtering noise from random value fluctuations. A moving average is a trend following lagging indicator based on past values. We used exponential moving averages, which gives bigger weight to more recent observations.

Forecast

Now when the behavior is known we can employ our predictive algorithm to make forecast for the next 15 days. Thus the test set consists of only 15 days and the rest of the data belongs to training set. The actual sales from the test set are putted away and used as control/reference parameter to compare the actual and predicted data and see the accuracy of the prediction.

set.seed(150)
train1$ma <- EMA(train1$sales, 15)
train1 <- train1[-(1:15),]

train1$days <- as.factor(train1$days)
train1$weeks <- as.factor(train1$weeks)
train1$weeknum <- as.factor(train1$weeknum)
train1$seasons <- as.factor(train1$seasons)
train1$month <- as.factor(train1$month)
train1$year <- as.factor(train1$year)

train <- train1[(1:848),]
val <- train1[(849:863),]
sales <- val$sales
val$sales <- NULL
Dates <- as.Date(as.character(val$TSDate))



formula <- sales~(days + weeks + seasons + weeknum + month + year)*ma

fit <- gbm(formula, data = train, n.trees = 100000)

## Distribution not specified, assuming gaussian ...

model <- predict(fit, newdata = val, n.trees = 100000)

Here is the table with actual sales (denoted as sales), predicted values (denoted as model), for 15 days period of time

df <- data.frame(Dates, sales, (floor(model)))
colnames(df)[3] <- "model"
df

##         Dates sales model
## 1  2015-11-01  4068  4331
## 2  2015-11-02  5787  5489
## 3  2015-11-03  5269  4985
## 4  2015-11-04  5145  5114
## 5  2015-11-05  5730  5300
## 6  2015-11-06  4843  4866
## 7  2015-11-07  3386  3310
## 8  2015-11-08  4681  4573
## 9  2015-11-09  5928  5947
## 10 2015-11-10  4700  5298
## 11 2015-11-11  5111  5301
## 12 2015-11-12  5669  5254
## 13 2015-11-13  5054  4820
## 14 2015-11-14  3280  3479
## 15 2015-11-15  4891  4855

and the correlation between the actual values and the model

We can also plot predicted and actual values for the examined period of time.

and calculate the RMSE (root mean square error)

err <- floor(sqrt(mean((sales - model)^2)))
err

## [1] 271