Time series analysis are aimed to extract meaningful statistics out of the time series data and to make forecast of the future trends based on previous and current observations. Prominent characteristics of time series is that they have its own internal structure like auto-correlation, trend or seasonal variation that should be accounted for. For example, sales of a company can rapidly grow over years but they still follow consistent seasonal patterns (with the picks in December and low seasonality in August). This document analyses daily sales data for a period of two and a half years. The goal is to identify patterns and to make forecast for the last two weeks (15 days). Predictive algorithm used for forecasting is stochastic gradient boosting. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of the ensemble of weak prediction models, typically decision trees. However this ensemble of weak predictors when combined together gives a strong model with superior accuracy. Stochastic gradient boosting is very popular method for making predictive models, but in time series analysis its use is still pioneering.
In order to get insight into the nature of the data and to identify hidden patterns exploratory data analysis are performed. First, we read the .csv file and visualize the original time series to see their behavior.
library(ggplot2)
library(TTR)
library(gbm)
train1 <- read.csv("training1_new.csv", header = TRUE, sep = ",")
head(train1)
## TSDate sales
## 1 2013-06-21 1248
## 2 2013-06-22 0
## 3 2013-06-23 0
## 4 2013-06-24 3393
## 5 2013-06-25 1767
## 6 2013-06-26 1663
dim(train1)
## [1] 878 2
The graph shows seasonal pattern with high seasonality round Christmas, upward trend and daily and weekly frequency. To make model that fits to the real nature of the data we introduce a few dummy variables, like days in the week, weekdays, weekends, month, year, number of the week seasonality with high season before Christmas and see what is the relation between the sales and given period of the year, month, week…
train1$days <- weekdays(as.Date(train1$TSDate))
train1$weeks <- ifelse(train1$days == "Saturday" | train1$days == "Sunday", "Weekend", "Weekday")
train1$month <- as.numeric(format(as.Date(train1$TSDate), "%m"))
train1$year <- as.numeric(format(as.Date(train1$TSDate), "%Y"))
train1$weeknum <- as.POSIXlt(train1$TSDate)
train1$weeknum <- strftime(train1$weeknum,format="%W")
train1$seasons <- ifelse(as.Date(train1$TSDate) %in% seq(as.Date('2013-12-01'),as.Date('2013-12-23'),by = 1) | as.Date(train1$TSDate) %in% seq(as.Date('2014-12-01'),as.Date('2014-12-23'),by = 1) | as.Date(train1$TSDate) %in% seq(as.Date('2014-06-23'),as.Date('2014-07-07'),by = 1) | as.Date(train1$TSDate) %in% seq(as.Date('2015-06-23'),as.Date('2015-07-07'),by = 1), "high season", "low season")
head(train1)
## TSDate sales days weeks month year weeknum seasons
## 1 2013-06-21 1248 Friday Weekday 6 2013 24 low season
## 2 2013-06-22 0 Saturday Weekend 6 2013 24 low season
## 3 2013-06-23 0 Sunday Weekend 6 2013 24 low season
## 4 2013-06-24 3393 Monday Weekday 6 2013 25 low season
## 5 2013-06-25 1767 Tuesday Weekday 6 2013 25 low season
## 6 2013-06-26 1663 Wednesday Weekday 6 2013 25 low season
To discover the hidden pattern and see the daily and weekly fluctuation of the sales we consider each day of the week as a separate factor and plot sales separately for each day of the week respective to the number of the week for the entire period of of two and a half years.
From the graph we can see daily and weekly frequency with highest sales in Monday and lowest in Saturday. As it was mentioned before the future values depends of the past values, what means that today sales depend of the sales in the previous period, and to get proper forecast this relation should be accounted for. For this purpose we introduced a new variable the moving averages of the past 15 days. Moving averages or in our case simply the average price from the last 15 days, is a widely used indicator in technical analysis that helps smooth out the forecast by filtering noise from random value fluctuations. A moving average is a trend following lagging indicator based on past values. We used exponential moving averages, which gives bigger weight to more recent observations.
Now when the behavior is known we can employ our predictive algorithm to make forecast for the next 15 days. Thus the test set consists of only 15 days and the rest of the data belongs to training set. The actual sales from the test set are putted away and used as control/reference parameter to compare the actual and predicted data and see the accuracy of the prediction.
set.seed(150)
train1$ma <- EMA(train1$sales, 15)
train1 <- train1[-(1:15),]
train1$days <- as.factor(train1$days)
train1$weeks <- as.factor(train1$weeks)
train1$weeknum <- as.factor(train1$weeknum)
train1$seasons <- as.factor(train1$seasons)
train1$month <- as.factor(train1$month)
train1$year <- as.factor(train1$year)
train <- train1[(1:848),]
val <- train1[(849:863),]
sales <- val$sales
val$sales <- NULL
Dates <- as.Date(as.character(val$TSDate))
formula <- sales~(days + weeks + seasons + weeknum + month + year)*ma
fit <- gbm(formula, data = train, n.trees = 100000)
## Distribution not specified, assuming gaussian ...
model <- predict(fit, newdata = val, n.trees = 100000)
Here is the table with actual sales (denoted as sales), predicted values (denoted as model), for 15 days period of time
df <- data.frame(Dates, sales, (floor(model)))
colnames(df)[3] <- "model"
df
## Dates sales model
## 1 2015-11-01 4068 4331
## 2 2015-11-02 5787 5489
## 3 2015-11-03 5269 4985
## 4 2015-11-04 5145 5114
## 5 2015-11-05 5730 5300
## 6 2015-11-06 4843 4866
## 7 2015-11-07 3386 3310
## 8 2015-11-08 4681 4573
## 9 2015-11-09 5928 5947
## 10 2015-11-10 4700 5298
## 11 2015-11-11 5111 5301
## 12 2015-11-12 5669 5254
## 13 2015-11-13 5054 4820
## 14 2015-11-14 3280 3479
## 15 2015-11-15 4891 4855
and the correlation between the actual values and the model
We can also plot predicted and actual values for the examined period of time.
and calculate the RMSE (root mean square error)
err <- floor(sqrt(mean((sales - model)^2)))
err
## [1] 271