Library

library(dplyr)          # data wrangling
library(lubridate)      # date functions
library(xts)            # time series object
library(padr)           # padding time series
library(forecast)       # forecasting time series
library(tseries)        # should be updated to cocomelon :)
library(ggplot2)        # graphing

Problem Statement

Have you ever watched an in-game advertisement to collect diamonds or to revive and try again? Yes, I’m talking about mobile game ads. If so, you are not alone. It turns out that people tend to watch mobile game ads more in specific hours throughout the day. In this project, we will try to forecast this behavior. Specifically, we will look into ads watched per hour data from a mobile game, and try to predict how many ads will be watched for the next 24 hours.

Dataset

Let’s read the dataset.

ads <- read.csv('ads.csv')
head(ads)
#>                  Time    Ads
#> 1 2017-09-13T00:00:00  80115
#> 2 2017-09-13T01:00:00  79885
#> 3 2017-09-13T02:00:00  89325
#> 4 2017-09-13T03:00:00 101930
#> 5 2017-09-13T04:00:00 121630
#> 6 2017-09-13T05:00:00 116475
range(ads$Time)
#> [1] "2017-09-13T00:00:00" "2017-09-21T23:00:00"
range(ads$Ads)
#> [1]  70335 169900

We have an hourly time series data from 2017-09-13 00:00:00 to 2017-09-21 23:00:00 of how many ads were watched in a mobile game, ranging from around 70,000 to 170,000 ads.

Data Cleaning

Time series is a series of data points indexed (or listed or graphed) in time order. Therefore, the data is organized by relatively deterministic timestamps, and may - compared to random sampled data - contain additional information that we can extract.

For starters, we can mutate Time feature from character to time format, pad to even up the time period, then check for missing values.

ads <- ads %>% 
  mutate(Time = ymd_hms(Time)) %>% 
  arrange(Time) %>% 
  pad()

ads %>% is.na() %>% colSums()
#> Time  Ads 
#>    0    0

Great! No missing values.

Now, we need to create a time series object from the data. Since our data is hourly, we can do this using xts() function. Order the data by time_index, which is just an hourly index like Time feature in the original data. Also, we need to set the frequency to 24 since in every 24 timestamps the data has a periodic manner.

time_index <- seq(
  from = as.POSIXct("2017-09-13 00:00:00"), 
  to = as.POSIXct("2017-09-21 23:00:00"), 
  by = "hour"
)

ads_xts <- xts(ads$Ads, order.by = time_index, frequency = 24)
ads_xts %>% autoplot() + ggtitle("Ads Watched per Hour")

Since time series object originated from xts() can’t be easily decomposed, from this point on we will convert it to another time series object using ts() function. As a result, we can apply many built-in functions (such as decompose()) even though as a drawback, the time index can’t be in hourly period. The author would really appreciate if there exists a simpler solution using time series originated from xts().

ads_ts <- ts(ads_xts, frequency = 24)

Decomposition

A time series is basically a composition of 3 basic time series:

  1. trend: an overall direction of the data, whether it is flat, increasing, or decreasing. If there is still a fluctuation in the trend, that means there is some pattern in data not accounted in the decompositions.
  2. seasonal: periodic pattern that occurs within fixed timestamp.
  3. remainder: any pattern not accounted in trend and seasonal.

What we mean by composition here is just an addition or multiplication of these three decompositions. An additive type time series has relatively constant variance all the time, whereas the multiplicative counterpart has a tendency of increasing or decreasing variance. As an illustration, please take a look at the picture below.

A multiplicative time series can be log-transformed to create an additive time series. Clearly, our data is already an additive time series. Let’s observe its decompositions.

dim(ads_ts) <- NULL
ads_ts %>% decompose() %>% autoplot()

It’s apparent that we have a fairly flat trend (no trend). However, there’s a slight bump in the trend curve on 16-17 September 2017. This is because those two days are weekends and people seem to play mobile games more on weekends hence more ads watched. Note that we lost some information as much as two halves of a day each at the beginning and the end of the trend curve due to central moving average implemented in calculating it.

It’s also clear that we have a 24-hour seasonality. Let’s zoom in a little.

window(
  x = ads_xts, 
  start = as.POSIXct("2017-09-13 00:00:00"), 
  end = as.POSIXct("2017-09-13 23:00:00")
) %>% autoplot() + ggtitle("Ads Watched per Hour on 13 September 2017")

The plot above is our time series data specifically on a full day of 13 September 2017 (representative for analysis purpose). The number of ads watched at night is approximately 80k. This number increased and reached its peak around 120k at 04:00. Perhaps some morning gamers reached their phone and wanted to catch on some in-game events? Then, the number decreased during morning routines and increased to its second peak around 160k at 12:00 during lunch. After that, it seems that some people is done for the day and continue playing while others are back to work indicating a slight decrease until 16:00. The number increased again to reach its third and last peak for the day 160k-170k at 18:00-19:00 (at the after work hours) before plummeted down back to around 80k at night.

Lastly, the remainder of our data is pretty small (which is good) except for some spikes on 16-17 September 2017. Again, this may come from irregularities during weekends.

Metrics and Validation

We will use MAPE (Mean Absolute Percentage Error) as our metric since it’s not dependent on the magnitude of the data, only on percentage to the original data, which makes it convenient to compare the performance of different models. Also, we will separate the data into train and test as usual for model validation. The test data will be the last 24 hours that we need to forecast as the problem statement inquired us to.

test_num <- 24
test <- tail(ads_ts, test_num)
train <- head(ads_ts, length(ads_ts)-length(test))

Modeling

We have several simple models to choose from together with which time series data they are suitable for:

  1. Simple Moving Average (SMA) \(\to\) no trend and seasonal
  2. Exponential Smoothing
    • Simple Exponential Smoothing (SES) \(\to\) no trend and seasonal
    • Double Exponential Smoothing (Holt Exponential Smoothing) \(\to\) with trend no seasonal
    • Triple Exponential Smoothing (Holt-Winters) \(\to\) with trend and seasonal
  3. Autoregressive Integrated Moving Average (ARIMA) \(\to\) with trend no seasonal
  4. Seasonal ARIMA (SARIMA) \(\to\) with trend and seasonal

Since we have seasonal data, we will use Holt-Winters and SARIMA.

Holt-Winters Exponential

First, build the model. Holt-Winters model automatically gives us smoothing parameters \(\alpha=0.92\), \(\beta=0\), and \(\gamma=1\) for this problem. We can manually tune these parameters, but for now we will stick to what’s given.

hw_model <- HoltWinters(x = train)
hw_model
#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#> 
#> Call:
#> HoltWinters(x = train)
#> 
#> Smoothing parameters:
#>  alpha: 0.9177049
#>  beta : 0
#>  gamma: 1
#> 
#> Coefficients:
#>            [,1]
#> a   116134.6370
#> b     -112.5410
#> s1  -42551.9149
#> s2  -46114.6303
#> s3  -38622.4558
#> s4  -23062.7216
#> s5     766.4552
#> s6   -2643.0073
#> s7  -10590.9356
#> s8  -16361.4525
#> s9  -18599.8027
#> s10  -9730.1904
#> s11   6773.7268
#> s12  23255.7037
#> s13  32034.0766
#> s14  29038.2208
#> s15  27572.8383
#> s16  26323.7715
#> s17  24075.6307
#> s18  26062.1503
#> s19  36996.4620
#> s20  39800.2418
#> s21  16732.8980
#> s22 -12780.5149
#> s23 -27289.9691
#> s24 -40084.6370

Now, forecast time…

hw_forecast <- forecast(hw_model, h = test_num)
hw_forecast
#>          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
#> 9.000000       73470.18  64566.94  82373.42  59853.85  87086.51
#> 9.041667       69794.92  57710.82  81879.03  51313.88  88275.97
#> 9.083333       77174.56  62587.46  91761.66  54865.52  99483.60
#> 9.125000       92621.75  75902.26 109341.24  67051.50 118192.00
#> 9.166667      116338.39  97729.27 134947.50  87878.20 144798.57
#> 9.208333      112816.38  92492.58 133140.19  81733.81 143898.96
#> 9.250000      104755.91  82851.24 126660.59  71255.60 138256.23
#> 9.291667       98872.86  75493.96 122251.75  63117.92 134627.79
#> 9.333333       96521.97  71756.45 121287.48  58646.38 134397.55
#> 9.375000      105279.04  79200.53 131357.54  65395.40 145162.67
#> 9.416667      121670.41  94341.92 148998.90  79875.09 163465.73
#> 9.458333      138039.85 109516.10 166563.60  94416.54 181663.16
#> 9.500000      146705.68 117034.78 176376.58 101327.96 192083.40
#> 9.541667      143597.28 112821.97 174372.60  96530.50 190664.06
#> 9.583333      142019.36 110177.91 173860.81  93322.07 190716.65
#> 9.625000      140657.75 107784.73 173530.78  90382.80 190932.70
#> 9.666667      138297.07 104423.87 172170.27  86492.48 190101.66
#> 9.708333      140171.05 105326.37 175015.72  86880.72 193461.38
#> 9.750000      150992.82 115203.03 186782.61  96257.06 205728.58
#> 9.791667      153684.06 116973.47 190394.64  97540.06 209828.05
#> 9.833333      130504.17  92895.33 168113.01  72986.41 188021.93
#> 9.875000      100878.22  62392.08 139364.36  42018.75 159737.69
#> 9.916667       86256.22  46912.35 125600.10  26084.96 146427.49
#> 9.958333       73349.02  33165.70 113532.33  11893.94 134804.09

…and plot.

train %>% 
  autoplot(series = "train") +
  autolayer(test, series = "test") +
  autolayer(hw_model$fitted[,1], series = "fitted") +
  autolayer(hw_forecast$mean, series = "forecast") + 
  ggtitle("Holt-Winters Prediction")

hw_mape <- cbind(
  'train_mape' = accuracy(hw_model$fitted[,1], train)[, 'MAPE'],
  'test_mape' = accuracy(hw_forecast$mean, test)[, 'MAPE']
)

hw_mape <- as.data.frame(hw_mape)
rownames(hw_mape) <- c('Holt-Winters')
hw_mape
#>              train_mape test_mape
#> Holt-Winters   3.800135  3.691772

From the above plot and MAPE score, we actually have descent result from Holt-Winters model, except for some after-lunch working hours. But we can’t really blame the model here since the data shows pattern irregularities during these hours.

SARIMA

Using SARIMA comes with complication: we have to determine many parameters and build the model based on the chosen parameters, assuming the data is stationary. To assert this stationary assumption, we can use what’s called Augmented Dickey-Fuller Hypothesis Test (or ADF test for short).

H0 : data is not stationary H1 : data is stationary

We wish p-value < alpha (0.05) so that H0 is rejected and the data is stationary.

adf.test(train)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  train
#> Dickey-Fuller = -4.9338, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary
train %>% 
  tsdisplay()

Based on ADF test above, the data is already stationary. This is no surprise since the data does not change its statistical properties over time, namely its mean and variance. We can see this on the plot itself: no visible trend (so the mean is constant) and the variance is pretty much stable. The only thing left is seasonality, which we have to deal with prior to modeling. To do so, let’s take the seasonal difference, which means a simple subtraction of the series from itself with a lag that equals the seasonal period.

train %>% 
  diff(lag = 24) %>%
  adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -2.4897, Lag order = 5, p-value = 0.3719
#> alternative hypothesis: stationary
train %>% 
  diff(lag = 24) %>%
  tsdisplay()

With the seasonality gone, the data becomes non-stationary (p-value > 0.05). Also, the ACF graph still shows many significant lags. To overcome this problem, we’ll take the first difference, subtracting the series from itself with lag 1.

train %>% 
  diff(lag = 24) %>%
  diff() %>%
  adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -6.4821, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary
train %>% 
  diff(lag = 24) %>%
  diff() %>%
  tsdisplay()

Now we’re in business! Let’s determine the parameters.

  1. Since we are doing seasonal difference once and first difference, then \(d=1\) and \(D=1\).
  2. From the PACF and ACF plot, we see that lags 3 and 4 are significant, so \(p=3\) or \(4\) and \(q=3\) or \(4\).
  3. From the PACF plot, lags 24 and 48 are significant, so \(P=1\) or \(2\).
  4. From the ACF plot, lag 24 is significant but lag 48 isn’t, so \(Q=0\) or \(1\).

To summarize,

p = 3 or 4 d = 1 q = 3 or 4 P = 1 or 2 D = 1 Q = 0 or 1

Based on these parameters, we will make 4 SARIMA models as follows.

sarima1_model <- Arima(y = train, order = c(4,1,4), seasonal = c(2,1,1))
sarima2_model <- Arima(y = train, order = c(4,1,4), seasonal = c(1,1,1))
sarima3_model <- Arima(y = train, order = c(3,1,3), seasonal = c(1,1,1))
sarima4_model <- Arima(y = train, order = c(3,1,3), seasonal = c(1,1,0))

Take a look at the AIC of each model. Remember, AIC is a measure of information loss, which makes sarima3_model is the best one for the moment. However, lower AIC doesn’t mean the model performs better on test dataset.

sarima1_model$aic
#> [1] 3383.505
sarima2_model$aic
#> [1] 3382.48
sarima3_model$aic
#> [1] 3379.964
sarima4_model$aic
#> [1] 3410.026

As usual, forecast time…

sarima1_forecast <- forecast(sarima1_model, h = test_num)
sarima2_forecast <- forecast(sarima2_model, h = test_num)
sarima3_forecast <- forecast(sarima3_model, h = test_num)
sarima4_forecast <- forecast(sarima4_model, h = test_num)

…and plot.

train %>% 
  autoplot(series = "train") +
  autolayer(test, series = "test") +
  autolayer(sarima1_model$fitted, series = "fitted") +
  autolayer(sarima1_forecast$mean, series = "forecast") +
  ggtitle("ARIMA(4,1,4)(2,1,1)[24] Prediction")

train %>% 
  autoplot(series = "train") +
  autolayer(test, series = "test") +
  autolayer(sarima2_model$fitted, series = "fitted") +
  autolayer(sarima2_forecast$mean, series = "forecast") +
  ggtitle("ARIMA(4,1,4)(1,1,1)[24] Prediction")

train %>% 
  autoplot(series = "train") +
  autolayer(test, series = "test") +
  autolayer(sarima3_model$fitted, series = "fitted") +
  autolayer(sarima3_forecast$mean, series = "forecast") +
  ggtitle("ARIMA(3,1,3)(1,1,1)[24] Prediction")

train %>% 
  autoplot(series = "train") +
  autolayer(test, series = "test") +
  autolayer(sarima4_model$fitted, series = "fitted") +
  autolayer(sarima4_forecast$mean, series = "forecast") +
  ggtitle("ARIMA(3,1,3)(1,1,0)[24] Prediction")

There are only slight differences from all four models. Just like Holt-Winters, the models struggle to forecast after-lunch working hours, even fail to forecast the second peak at 12:00. We can also see that sarima4_model is the best in forecasting morning hours.

train_mape <- rbind(
  'ARIMA(4,1,4)(2,1,1)[24]' = accuracy(sarima1_model)[, 'MAPE'],
  'ARIMA(4,1,4)(1,1,1)[24]' = accuracy(sarima2_model)[, 'MAPE'],
  'ARIMA(3,1,3)(1,1,1)[24]' = accuracy(sarima3_model)[, 'MAPE'],
  'ARIMA(3,1,3)(1,1,0)[24]' = accuracy(sarima4_model)[, 'MAPE']
)

test_mape <- rbind(
  accuracy(sarima1_forecast$mean, test)[, 'MAPE'],
  accuracy(sarima2_forecast$mean, test)[, 'MAPE'],
  accuracy(sarima3_forecast$mean, test)[, 'MAPE'],
  accuracy(sarima4_forecast$mean, test)[, 'MAPE']
)

sarima_mape <- cbind(train_mape, test_mape)
sarima_mape <- as.data.frame(sarima_mape)
colnames(sarima_mape) <- c('train_mape', 'test_mape')
result <- rbind(hw_mape, sarima_mape)
result
#>                         train_mape test_mape
#> Holt-Winters              3.800135  3.691772
#> ARIMA(4,1,4)(2,1,1)[24]   2.628641  4.185137
#> ARIMA(4,1,4)(1,1,1)[24]   2.680071  4.013859
#> ARIMA(3,1,3)(1,1,1)[24]   2.711371  3.995025
#> ARIMA(3,1,3)(1,1,0)[24]   3.341027  3.758555

The first three models of SARIMA indicate overfitting to some extend. On the other hand, the last model is definitely better than Holt-Winters on train dataset but slightly worse than Holt-Winters on test dataset.

Auto ARIMA

We can actually automate the parameter search of SARIMA model. However, the result will not always be better than manual tuning, there is some luck taking part here. Let’s get right into it.

auto_arima_model <- auto.arima(train)
auto_arima_model
#> Series: train 
#> ARIMA(3,0,3)(1,1,0)[24] 
#> 
#> Coefficients:
#>          ar1      ar2     ar3      ma1     ma2      ma3     sar1
#>       1.5185  -1.0912  0.4706  -0.5546  0.6366  -0.3294  -0.3786
#> s.e.  0.2028   0.2050  0.1065   0.2012  0.1330   0.1084   0.0756
#> 
#> sigma^2 estimated as 38627419:  log likelihood=-1705.08
#> AIC=3426.17   AICc=3427.07   BIC=3451.16

We got ARIMA(3,0,3)(1,1,0)[24] model which is very close to our best SARIMA model from before, with the discrepancy only on the \(d\) parameter. This new model doesn’t consider differencing by 1 lag as the earlier process suggested. The AIC is also the worst among all models encountered so far. However, let’s continue and see what happens.

auto_arima_forecast <- forecast(auto_arima_model, h = test_num)
train %>% 
  autoplot(series = "train") +
  autolayer(test, series = "test") +
  autolayer(auto_arima_model$fitted, series = "fitted") +
  autolayer(auto_arima_forecast$mean, series = "forecast") + 
  ggtitle("Auto ARIMA Prediction")

auto_arima_mape <- cbind(
  'train_mape' = accuracy(auto_arima_model)[, 'MAPE'],
  'test_mape' = accuracy(auto_arima_forecast$mean, test)[, 'MAPE']
)

auto_arima_mape <- as.data.frame(auto_arima_mape)
rownames(auto_arima_mape) <- c('Auto ARIMA')
result <- rbind(result, auto_arima_mape)
result
#>                         train_mape test_mape
#> Holt-Winters              3.800135  3.691772
#> ARIMA(4,1,4)(2,1,1)[24]   2.628641  4.185137
#> ARIMA(4,1,4)(1,1,1)[24]   2.680071  4.013859
#> ARIMA(3,1,3)(1,1,1)[24]   2.711371  3.995025
#> ARIMA(3,1,3)(1,1,0)[24]   3.341027  3.758555
#> Auto ARIMA                3.332454  2.890477

We’ve found the winner! Not only performs well on train dataset, Auto ARIMA crashed test dataset as well. This is the best model we made so far.

Assumptions

We will only test the assumptions of the best model (Auto ARIMA). Assumptions on time series data are tested to determine whether the residual obtained from modeling is good enough to explain that the trend and seasonal compositions already captured as much information as possible. The best model is the one with no autocorrelated residuals and the residuals are normally distributed.

No-autocorrelated residuals

If the residuals are autocorrelated, then there is still some information left behind that should be used to further improve the model. To see this, we can plot the ACF and PACF plots of residuals and see if there are still significant lags.

auto_arima_model$residuals %>% 
  tsdisplay()

As we can see, there are no significant lags left suggested in the ACF and PACF plot. Wait, there are actually some: lag 42 in the ACF plot and lag 27 in the PACF plot. However, these lags are too far away from the present and hence their significance is negligible.

If it’s hard to test the assumption from the plot, there is a more objective way to do so: the Box-Ljung test.

H0: residuals have no-autocorrelation H1: residuals have autocorrelation

We want p-value > alpha (0.05) so that H0 fails to be rejected and residuals have no autocorrelation.

Box.test(auto_arima_model$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  auto_arima_model$residuals
#> X-squared = 0.044539, df = 1, p-value = 0.8329

As we can see, p-value > 0.05, which confirms the previous observation from the ACF and PACF plot.

Normally distributed residuals

To check this, we can observe from the histogram of residuals. Are they normally distributed around zero?

hist(auto_arima_model$residuals)

If it’s hard to tell, Shapiro–Wilk test is always our friend.

H0: residuals are normally distributed H1: residuals are not normally distributed

We want p-value > alpha (0.05) so that H0 fails to be rejected and the residuals are normally distributed.

shapiro.test(auto_arima_model$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  auto_arima_model$residuals
#> W = 0.94388, p-value = 0.0000008032

Since p-value is below alpha (0.05), reject H0. Hence, residuals are not normally distributed.

Conclusion

result
#>                         train_mape test_mape
#> Holt-Winters              3.800135  3.691772
#> ARIMA(4,1,4)(2,1,1)[24]   2.628641  4.185137
#> ARIMA(4,1,4)(1,1,1)[24]   2.680071  4.013859
#> ARIMA(3,1,3)(1,1,1)[24]   2.711371  3.995025
#> ARIMA(3,1,3)(1,1,0)[24]   3.341027  3.758555
#> Auto ARIMA                3.332454  2.890477

We’ve implemented time series forecasting on hourly mobile game ads dataset. Since the dataset has seasonality, we use Holt-Winters and SARIMA model. We see that Auto ARIMA with parameters ARIMA(3,0,3)(1,1,0)[24] is the best model with 3.33% MAPE on train dataset and 2.89% MAPE on test dataset. However, this model violates normally distributed residuals assumption of time series data.