Data preprocessing:

  1. Inserting rows with missed hours. There are 49 days in orders dataset, where data for one or more hours is missed. This was treated as non-working hours and rows were inserted with number of orders as 0;

List of days with missed hours:

 [1] "2013-09-04" "2013-09-07" "2013-09-09" "2013-09-16" "2013-09-30"
 [6] "2013-10-01" "2013-10-02" "2013-10-04" "2013-10-10" "2013-10-21"
[11] "2013-10-22" "2013-10-31" "2013-11-05" "2013-11-11" "2013-11-12"
[16] "2013-11-15" "2013-12-05" "2013-12-25" "2014-01-22" "2014-02-04"
[21] "2014-02-11" "2014-02-18" "2014-02-19" "2014-02-20" "2014-03-12"
[26] "2014-03-25" "2014-04-01" "2014-04-07" "2014-04-21" "2014-04-25"
[31] "2014-04-28" "2014-05-01" "2014-05-07" "2014-05-09" "2014-05-12"
[36] "2014-05-13" "2014-05-15" "2014-05-29" "2014-06-01" "2014-06-05"
[41] "2014-08-19" "2014-09-11" "2014-09-15" "2014-09-25" "2014-10-01"
[46] "2014-11-03" "2014-12-25" "2015-02-02" "2015-12-25"

After that data is ready for analysis and could be viewed:


Statistical exploration of the dataset and identifying patterns in data

  1. Firstly, let’s plot the data:

    So, as could be seen, data is periodical with the smallest period - 1 day. Also, we could observe descent of number of orders at Christmas time. Every day is characterized with two peaks: first at dinner time, second - evening time.

  2. General Statistics of number of orders:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   26.00   53.00   77.32   98.00  440.00 

So, our data is highly right skewed. And mostly there are from 20 to 30 orders per hour. At average, there are 53 orders per hour.

  1. Trend in data:

    There is a positive trend in number of orders per month. At average it has increased on 20k for two years.

  2. Difference in days of the week:

    On the plot it is shown that there is no significant difference in number of orders for different week days.
    Let’s verify this with statistical test. One-way t-test and pairwise test:


    One-way analysis of means (not assuming equal variances)

data:  days_orders$order_count and days_orders$order_date
F = 2.996, num df = 6.00, denom df = 355.34, p-value = 0.007215

    Pairwise comparisons using t tests with pooled SD 

data:  days_orders$order_count and days_orders$order_date 

          Sunday Saturday Friday Thursday Wednesday Tuesday
Saturday  1.0000 -        -      -        -         -      
Friday    1.0000 1.0000   -      -        -         -      
Thursday  0.3768 0.6525   0.2719 -        -         -      
Wednesday 1.0000 1.0000   1.0000 1.0000   -         -      
Tuesday   0.6476 0.7215   1.0000 0.0015   0.1547    -      
Monday    1.0000 1.0000   1.0000 0.9489   1.0000    0.7215 

P value adjustment method: holm 

One-way t-test shows that p = 0.007215 < 0.05, so with 95% we could conclude that data provide convincing evidence that at least one pair of means of number of orders among weekdays are different from each other (we could reject H0 Hypothesis, which stands for their equality). Based on pairwise t-test we can conclude with 95% confidence level that Tuesday has smaller average number of orders than Thursday.

  1. Difference in number of orders per day for different months:

    July and September are characterized by smaller number of orders per day than January and February (also could be verified by statistical test). Possible reason of this - July and September are vocation months.

  2. Difference in number of orders per hour for different hours of day:

    Here interesting patterns are shown. Hours 0-10 are characterized by very small number of orders per hour, where 17-20 hours are very intensive.

  3. Difference in working and weekend days: Almost for the whole range of hours we could observe higher average number of orders per hour for weekends.


Outliers detection

  1. Outliers detection is performed separately for each day of week (as we observed some difference in them) in next way:
  1. Outliers for each week day could be visualized with boxplot:

    So, all outliers are visualized as dots in the plot above. For example one outlier for Friday corresponds to 2014-07-04 - Independence day in USA.

  2. List of all found outliers:

 [1] "2015-02-02" "2015-03-23" "2015-05-25" "2015-06-15" "2015-09-07"
 [6] "2015-12-28" "2013-12-24" "2013-12-25" "2014-01-01" "2014-12-24"
[11] "2015-12-30" "2013-11-28" "2014-11-27" "2015-01-01" "2015-02-19"
[16] "2015-11-26" "2014-07-04" "2015-02-14" "2015-10-31" "2015-11-21"
[21] "2014-05-25" "2015-02-15" "2015-04-19" "2015-11-01"

There are 24 outlier days.

  1. Percentage of outliers comparing to bank holiday dataset, getting from here:

0.375 of outliers in our timeseries are bank holiday days.

A lot of another outliers refer to national holidays in USA:

 [1] "2015-02-02" "2015-03-23" "2015-06-15" "2015-12-28" "2013-12-24"
 [6] "2014-12-24" "2015-12-30" "2015-02-19" "2015-02-14" "2015-10-31"
[11] "2015-11-21" "2014-05-25" "2015-02-15" "2015-04-19" "2015-11-01"

ARIMA Model

  1. As a first, order of differentiating to achive stationary timeseries is needed to be found:

    KPSS Test for Level Stationarity

data:  orders_ts_train
KPSS Level = 12.737, Truncation lag parameter = 33, p-value = 0.01

As p value < 0.01, we could reject NULL hypothesis and accept alternative that time series is non-stationary with 99%. After differenciating at 1 order:


    KPSS Test for Level Stationarity

data:  diff(orders_ts_train)
KPSS Level = 0.0006921, Truncation lag parameter = 33, p-value =
0.1

p = 0.1 and we could accept null hypothesis about timeseries stationary.

  1. Fitting ARIMA model:

Call:
arima(x = orders_ts_train, order = c(2, 1, 1), seasonal = list(order = c(2, 
    1, 0), period = 24))

Coefficients:
         ar1      ar2      ma1     sar1     sar2
      1.1037  -0.3415  -0.9490  -0.4975  -0.2542
s.e.  0.0071   0.0068   0.0037   0.0069   0.0068

sigma^2 estimated as 140.8:  log likelihood = -79688.93,  aic = 159389.9
  1. Accuracy. Lets measure accuracy with RMSE for training data and testing data (prediction for the last 7 days):

RMSE on training data:

[1] 11.85812

RMSE on testing data:

[1] 41.77625
  1. Plotting prediction for last week:

  2. Disadvantages of ARIMA model: