List of days with missed hours:
[1] "2013-09-04" "2013-09-07" "2013-09-09" "2013-09-16" "2013-09-30"
[6] "2013-10-01" "2013-10-02" "2013-10-04" "2013-10-10" "2013-10-21"
[11] "2013-10-22" "2013-10-31" "2013-11-05" "2013-11-11" "2013-11-12"
[16] "2013-11-15" "2013-12-05" "2013-12-25" "2014-01-22" "2014-02-04"
[21] "2014-02-11" "2014-02-18" "2014-02-19" "2014-02-20" "2014-03-12"
[26] "2014-03-25" "2014-04-01" "2014-04-07" "2014-04-21" "2014-04-25"
[31] "2014-04-28" "2014-05-01" "2014-05-07" "2014-05-09" "2014-05-12"
[36] "2014-05-13" "2014-05-15" "2014-05-29" "2014-06-01" "2014-06-05"
[41] "2014-08-19" "2014-09-11" "2014-09-15" "2014-09-25" "2014-10-01"
[46] "2014-11-03" "2014-12-25" "2015-02-02" "2015-12-25"
After that data is ready for analysis and could be viewed:
Firstly, let’s plot the data:
So, as could be seen, data is periodical with the smallest period - 1 day. Also, we could observe descent of number of orders at Christmas time. Every day is characterized with two peaks: first at dinner time, second - evening time.General Statistics of number of orders:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 26.00 53.00 77.32 98.00 440.00
So, our data is highly right skewed. And mostly there are from 20 to 30 orders per hour. At average, there are 53 orders per hour.
Trend in data:
There is a positive trend in number of orders per month. At average it has increased on 20k for two years.Difference in days of the week:
On the plot it is shown that there is no significant difference in number of orders for different week days.
One-way analysis of means (not assuming equal variances)
data: days_orders$order_count and days_orders$order_date
F = 2.996, num df = 6.00, denom df = 355.34, p-value = 0.007215
Pairwise comparisons using t tests with pooled SD
data: days_orders$order_count and days_orders$order_date
Sunday Saturday Friday Thursday Wednesday Tuesday
Saturday 1.0000 - - - - -
Friday 1.0000 1.0000 - - - -
Thursday 0.3768 0.6525 0.2719 - - -
Wednesday 1.0000 1.0000 1.0000 1.0000 - -
Tuesday 0.6476 0.7215 1.0000 0.0015 0.1547 -
Monday 1.0000 1.0000 1.0000 0.9489 1.0000 0.7215
P value adjustment method: holm
One-way t-test shows that p = 0.007215 < 0.05, so with 95% we could conclude that data provide convincing evidence that at least one pair of means of number of orders among weekdays are different from each other (we could reject H0 Hypothesis, which stands for their equality). Based on pairwise t-test we can conclude with 95% confidence level that Tuesday has smaller average number of orders than Thursday.
Difference in number of orders per day for different months:
July and September are characterized by smaller number of orders per day than January and February (also could be verified by statistical test). Possible reason of this - July and September are vocation months.Difference in number of orders per hour for different hours of day:
Here interesting patterns are shown. Hours 0-10 are characterized by very small number of orders per hour, where 17-20 hours are very intensive.Difference in working and weekend days: Almost for the whole range of hours we could observe higher average number of orders per hour for weekends.
Outliers for each week day could be visualized with boxplot:
So, all outliers are visualized as dots in the plot above. For example one outlier for Friday corresponds to 2014-07-04 - Independence day in USA.List of all found outliers:
[1] "2015-02-02" "2015-03-23" "2015-05-25" "2015-06-15" "2015-09-07"
[6] "2015-12-28" "2013-12-24" "2013-12-25" "2014-01-01" "2014-12-24"
[11] "2015-12-30" "2013-11-28" "2014-11-27" "2015-01-01" "2015-02-19"
[16] "2015-11-26" "2014-07-04" "2015-02-14" "2015-10-31" "2015-11-21"
[21] "2014-05-25" "2015-02-15" "2015-04-19" "2015-11-01"
There are 24 outlier days.
0.375 of outliers in our timeseries are bank holiday days.
A lot of another outliers refer to national holidays in USA:
[1] "2015-02-02" "2015-03-23" "2015-06-15" "2015-12-28" "2013-12-24"
[6] "2014-12-24" "2015-12-30" "2015-02-19" "2015-02-14" "2015-10-31"
[11] "2015-11-21" "2014-05-25" "2015-02-15" "2015-04-19" "2015-11-01"
KPSS Test for Level Stationarity
data: orders_ts_train
KPSS Level = 12.737, Truncation lag parameter = 33, p-value = 0.01
As p value < 0.01, we could reject NULL hypothesis and accept alternative that time series is non-stationary with 99%. After differenciating at 1 order:
KPSS Test for Level Stationarity
data: diff(orders_ts_train)
KPSS Level = 0.0006921, Truncation lag parameter = 33, p-value =
0.1
p = 0.1 and we could accept null hypothesis about timeseries stationary.
Call:
arima(x = orders_ts_train, order = c(2, 1, 1), seasonal = list(order = c(2,
1, 0), period = 24))
Coefficients:
ar1 ar2 ma1 sar1 sar2
1.1037 -0.3415 -0.9490 -0.4975 -0.2542
s.e. 0.0071 0.0068 0.0037 0.0069 0.0068
sigma^2 estimated as 140.8: log likelihood = -79688.93, aic = 159389.9
RMSE on training data:
[1] 11.85812
RMSE on testing data:
[1] 41.77625
Plotting prediction for last week: