Have you ever watched an in-game advertisement to collect diamonds or to revive and try again? Yes, I’m talking about mobile game ads. If so, you are not alone. It turns out that people tend to watch mobile game ads more in specific hours throughout the day. In this project, we will try to forecast this behavior. Specifically, we will look into ads watched per hour data from a mobile game, and try to predict how many ads will be watched for the next 24 hours.
Let’s read the dataset.
#> Time Ads
#> 1 2017-09-13T00:00:00 80115
#> 2 2017-09-13T01:00:00 79885
#> 3 2017-09-13T02:00:00 89325
#> 4 2017-09-13T03:00:00 101930
#> 5 2017-09-13T04:00:00 121630
#> 6 2017-09-13T05:00:00 116475
#> [1] "2017-09-13T00:00:00" "2017-09-21T23:00:00"
#> [1] 70335 169900
We have an hourly time series data from 2017-09-13 00:00:00 to 2017-09-21 23:00:00 of how many ads were watched in a mobile game, ranging from around 70,000 to 170,000 ads.
Time series is a series of data points indexed (or listed or graphed) in time order. Therefore, the data is organized by relatively deterministic timestamps, and may - compared to random sampled data - contain additional information that we can extract.
For starters, we can mutate Time feature from character to time format, pad to even up the time period, then check for missing values.
ads <- ads %>%
mutate(Time = ymd_hms(Time)) %>%
arrange(Time) %>%
pad()
ads %>% is.na() %>% colSums()#> Time Ads
#> 0 0
Great! No missing values.
Now, we need to create a time series object from the data. Since our data is hourly, we can do this using xts() function. Order the data by time_index, which is just an hourly index like Time feature in the original data. Also, we need to set the frequency to 24 since in every 24 timestamps the data has a periodic manner.
time_index <- seq(
from = as.POSIXct("2017-09-13 00:00:00"),
to = as.POSIXct("2017-09-21 23:00:00"),
by = "hour"
)
ads_xts <- xts(ads$Ads, order.by = time_index, frequency = 24)
ads_xts %>% autoplot() + ggtitle("Ads Watched per Hour")Since time series object originated from xts() can’t be easily decomposed, from this point on we will convert it to another time series object using ts() function. As a result, we can apply many built-in functions (such as decompose()) even though as a drawback, the time index can’t be in hourly period. The author would really appreciate if there exists a simpler solution using time series originated from xts().
A time series is basically a composition of 3 basic time series:
What we mean by composition here is just an addition or multiplication of these three decompositions. An additive type time series has relatively constant variance all the time, whereas the multiplicative counterpart has a tendency of increasing or decreasing variance. As an illustration, please take a look at the picture below.
A multiplicative time series can be log-transformed to create an additive time series. Clearly, our data is already an additive time series. Let’s observe its decompositions.
It’s apparent that we have a fairly flat trend (no trend). However, there’s a slight bump in the trend curve on 16-17 September 2017. This is because those two days are weekends and people seem to play mobile games more on weekends hence more ads watched. Note that we lost some information as much as two halves of a day each at the beginning and the end of the trend curve due to central moving average implemented in calculating it.
It’s also clear that we have a 24-hour seasonality. Let’s zoom in a little.
window(
x = ads_xts,
start = as.POSIXct("2017-09-13 00:00:00"),
end = as.POSIXct("2017-09-13 23:00:00")
) %>% autoplot() + ggtitle("Ads Watched per Hour on 13 September 2017")The plot above is our time series data specifically on a full day of 13 September 2017 (representative for analysis purpose). The number of ads watched at night is approximately 80k. This number increased and reached its peak around 120k at 04:00. Perhaps some morning gamers reached their phone and wanted to catch on some in-game events? Then, the number decreased during morning routines and increased to its second peak around 160k at 12:00 during lunch. After that, it seems that some people is done for the day and continue playing while others are back to work indicating a slight decrease until 16:00. The number increased again to reach its third and last peak for the day 160k-170k at 18:00-19:00 (at the after work hours) before plummeted down back to around 80k at night.
Lastly, the remainder of our data is pretty small (which is good) except for some spikes on 16-17 September 2017. Again, this may come from irregularities during weekends.
We will use MAPE (Mean Absolute Percentage Error) as our metric since it’s not dependent on the magnitude of the data, only on percentage to the original data, which makes it convenient to compare the performance of different models. Also, we will separate the data into train and test as usual for model validation. The test data will be the last 24 hours that we need to forecast as the problem statement inquired us to.
We have several simple models to choose from together with which time series data they are suitable for:
Since we have seasonal data, we will use Holt-Winters and SARIMA.
First, build the model. Holt-Winters model automatically gives us smoothing parameters \(\alpha=0.92\), \(\beta=0\), and \(\gamma=1\) for this problem. We can manually tune these parameters, but for now we will stick to what’s given.
#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#>
#> Call:
#> HoltWinters(x = train)
#>
#> Smoothing parameters:
#> alpha: 0.9177049
#> beta : 0
#> gamma: 1
#>
#> Coefficients:
#> [,1]
#> a 116134.6370
#> b -112.5410
#> s1 -42551.9149
#> s2 -46114.6303
#> s3 -38622.4558
#> s4 -23062.7216
#> s5 766.4552
#> s6 -2643.0073
#> s7 -10590.9356
#> s8 -16361.4525
#> s9 -18599.8027
#> s10 -9730.1904
#> s11 6773.7268
#> s12 23255.7037
#> s13 32034.0766
#> s14 29038.2208
#> s15 27572.8383
#> s16 26323.7715
#> s17 24075.6307
#> s18 26062.1503
#> s19 36996.4620
#> s20 39800.2418
#> s21 16732.8980
#> s22 -12780.5149
#> s23 -27289.9691
#> s24 -40084.6370
Now, forecast time…
#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> 9.000000 73470.18 64566.94 82373.42 59853.85 87086.51
#> 9.041667 69794.92 57710.82 81879.03 51313.88 88275.97
#> 9.083333 77174.56 62587.46 91761.66 54865.52 99483.60
#> 9.125000 92621.75 75902.26 109341.24 67051.50 118192.00
#> 9.166667 116338.39 97729.27 134947.50 87878.20 144798.57
#> 9.208333 112816.38 92492.58 133140.19 81733.81 143898.96
#> 9.250000 104755.91 82851.24 126660.59 71255.60 138256.23
#> 9.291667 98872.86 75493.96 122251.75 63117.92 134627.79
#> 9.333333 96521.97 71756.45 121287.48 58646.38 134397.55
#> 9.375000 105279.04 79200.53 131357.54 65395.40 145162.67
#> 9.416667 121670.41 94341.92 148998.90 79875.09 163465.73
#> 9.458333 138039.85 109516.10 166563.60 94416.54 181663.16
#> 9.500000 146705.68 117034.78 176376.58 101327.96 192083.40
#> 9.541667 143597.28 112821.97 174372.60 96530.50 190664.06
#> 9.583333 142019.36 110177.91 173860.81 93322.07 190716.65
#> 9.625000 140657.75 107784.73 173530.78 90382.80 190932.70
#> 9.666667 138297.07 104423.87 172170.27 86492.48 190101.66
#> 9.708333 140171.05 105326.37 175015.72 86880.72 193461.38
#> 9.750000 150992.82 115203.03 186782.61 96257.06 205728.58
#> 9.791667 153684.06 116973.47 190394.64 97540.06 209828.05
#> 9.833333 130504.17 92895.33 168113.01 72986.41 188021.93
#> 9.875000 100878.22 62392.08 139364.36 42018.75 159737.69
#> 9.916667 86256.22 46912.35 125600.10 26084.96 146427.49
#> 9.958333 73349.02 33165.70 113532.33 11893.94 134804.09
…and plot.
train %>%
autoplot(series = "train") +
autolayer(test, series = "test") +
autolayer(hw_model$fitted[,1], series = "fitted") +
autolayer(hw_forecast$mean, series = "forecast") +
ggtitle("Holt-Winters Prediction")hw_mape <- cbind(
'train_mape' = accuracy(hw_model$fitted[,1], train)[, 'MAPE'],
'test_mape' = accuracy(hw_forecast$mean, test)[, 'MAPE']
)
hw_mape <- as.data.frame(hw_mape)
rownames(hw_mape) <- c('Holt-Winters')
hw_mape#> train_mape test_mape
#> Holt-Winters 3.800135 3.691772
From the above plot and MAPE score, we actually have descent result from Holt-Winters model, except for some after-lunch working hours. But we can’t really blame the model here since the data shows pattern irregularities during these hours.
Using SARIMA comes with complication: we have to determine many parameters and build the model based on the chosen parameters, assuming the data is stationary. To assert this stationary assumption, we can use what’s called Augmented Dickey-Fuller Hypothesis Test (or ADF test for short).
H0 : data is not stationary H1 : data is stationary
We wish p-value < alpha (0.05) so that H0 is rejected and the data is stationary.
#>
#> Augmented Dickey-Fuller Test
#>
#> data: train
#> Dickey-Fuller = -4.9338, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary
Based on ADF test above, the data is already stationary. This is no surprise since the data does not change its statistical properties over time, namely its mean and variance. We can see this on the plot itself: no visible trend (so the mean is constant) and the variance is pretty much stable. The only thing left is seasonality, which we have to deal with prior to modeling. To do so, let’s take the seasonal difference, which means a simple subtraction of the series from itself with a lag that equals the seasonal period.
#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -2.4897, Lag order = 5, p-value = 0.3719
#> alternative hypothesis: stationary
With the seasonality gone, the data becomes non-stationary (p-value > 0.05). Also, the ACF graph still shows many significant lags. To overcome this problem, we’ll take the first difference, subtracting the series from itself with lag 1.
#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -6.4821, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary
Now we’re in business! Let’s determine the parameters.
To summarize,
p = 3 or 4 d = 1 q = 3 or 4 P = 1 or 2 D = 1 Q = 0 or 1
Based on these parameters, we will make 4 SARIMA models as follows.
sarima1_model <- Arima(y = train, order = c(4,1,4), seasonal = c(2,1,1))
sarima2_model <- Arima(y = train, order = c(4,1,4), seasonal = c(1,1,1))
sarima3_model <- Arima(y = train, order = c(3,1,3), seasonal = c(1,1,1))
sarima4_model <- Arima(y = train, order = c(3,1,3), seasonal = c(1,1,0))Take a look at the AIC of each model. Remember, AIC is a measure of information loss, which makes sarima3_model is the best one for the moment. However, lower AIC doesn’t mean the model performs better on test dataset.
#> [1] 3383.505
#> [1] 3382.48
#> [1] 3379.964
#> [1] 3410.026
As usual, forecast time…
sarima1_forecast <- forecast(sarima1_model, h = test_num)
sarima2_forecast <- forecast(sarima2_model, h = test_num)
sarima3_forecast <- forecast(sarima3_model, h = test_num)
sarima4_forecast <- forecast(sarima4_model, h = test_num)…and plot.
train %>%
autoplot(series = "train") +
autolayer(test, series = "test") +
autolayer(sarima1_model$fitted, series = "fitted") +
autolayer(sarima1_forecast$mean, series = "forecast") +
ggtitle("ARIMA(4,1,4)(2,1,1)[24] Prediction")train %>%
autoplot(series = "train") +
autolayer(test, series = "test") +
autolayer(sarima2_model$fitted, series = "fitted") +
autolayer(sarima2_forecast$mean, series = "forecast") +
ggtitle("ARIMA(4,1,4)(1,1,1)[24] Prediction")train %>%
autoplot(series = "train") +
autolayer(test, series = "test") +
autolayer(sarima3_model$fitted, series = "fitted") +
autolayer(sarima3_forecast$mean, series = "forecast") +
ggtitle("ARIMA(3,1,3)(1,1,1)[24] Prediction")train %>%
autoplot(series = "train") +
autolayer(test, series = "test") +
autolayer(sarima4_model$fitted, series = "fitted") +
autolayer(sarima4_forecast$mean, series = "forecast") +
ggtitle("ARIMA(3,1,3)(1,1,0)[24] Prediction")There are only slight differences from all four models. Just like Holt-Winters, the models struggle to forecast after-lunch working hours, even fail to forecast the second peak at 12:00. We can also see that sarima4_model is the best in forecasting morning hours.
train_mape <- rbind(
'ARIMA(4,1,4)(2,1,1)[24]' = accuracy(sarima1_model)[, 'MAPE'],
'ARIMA(4,1,4)(1,1,1)[24]' = accuracy(sarima2_model)[, 'MAPE'],
'ARIMA(3,1,3)(1,1,1)[24]' = accuracy(sarima3_model)[, 'MAPE'],
'ARIMA(3,1,3)(1,1,0)[24]' = accuracy(sarima4_model)[, 'MAPE']
)
test_mape <- rbind(
accuracy(sarima1_forecast$mean, test)[, 'MAPE'],
accuracy(sarima2_forecast$mean, test)[, 'MAPE'],
accuracy(sarima3_forecast$mean, test)[, 'MAPE'],
accuracy(sarima4_forecast$mean, test)[, 'MAPE']
)
sarima_mape <- cbind(train_mape, test_mape)
sarima_mape <- as.data.frame(sarima_mape)
colnames(sarima_mape) <- c('train_mape', 'test_mape')
result <- rbind(hw_mape, sarima_mape)
result#> train_mape test_mape
#> Holt-Winters 3.800135 3.691772
#> ARIMA(4,1,4)(2,1,1)[24] 2.628641 4.185137
#> ARIMA(4,1,4)(1,1,1)[24] 2.680071 4.013859
#> ARIMA(3,1,3)(1,1,1)[24] 2.711371 3.995025
#> ARIMA(3,1,3)(1,1,0)[24] 3.341027 3.758555
The first three models of SARIMA indicate overfitting to some extend. On the other hand, the last model is definitely better than Holt-Winters on train dataset but slightly worse than Holt-Winters on test dataset.
We can actually automate the parameter search of SARIMA model. However, the result will not always be better than manual tuning, there is some luck taking part here. Let’s get right into it.
#> Series: train
#> ARIMA(3,0,3)(1,1,0)[24]
#>
#> Coefficients:
#> ar1 ar2 ar3 ma1 ma2 ma3 sar1
#> 1.5185 -1.0912 0.4706 -0.5546 0.6366 -0.3294 -0.3786
#> s.e. 0.2028 0.2050 0.1065 0.2012 0.1330 0.1084 0.0756
#>
#> sigma^2 estimated as 38627419: log likelihood=-1705.08
#> AIC=3426.17 AICc=3427.07 BIC=3451.16
We got ARIMA(3,0,3)(1,1,0)[24] model which is very close to our best SARIMA model from before, with the discrepancy only on the \(d\) parameter. This new model doesn’t consider differencing by 1 lag as the earlier process suggested. The AIC is also the worst among all models encountered so far. However, let’s continue and see what happens.
train %>%
autoplot(series = "train") +
autolayer(test, series = "test") +
autolayer(auto_arima_model$fitted, series = "fitted") +
autolayer(auto_arima_forecast$mean, series = "forecast") +
ggtitle("Auto ARIMA Prediction")auto_arima_mape <- cbind(
'train_mape' = accuracy(auto_arima_model)[, 'MAPE'],
'test_mape' = accuracy(auto_arima_forecast$mean, test)[, 'MAPE']
)
auto_arima_mape <- as.data.frame(auto_arima_mape)
rownames(auto_arima_mape) <- c('Auto ARIMA')
result <- rbind(result, auto_arima_mape)
result#> train_mape test_mape
#> Holt-Winters 3.800135 3.691772
#> ARIMA(4,1,4)(2,1,1)[24] 2.628641 4.185137
#> ARIMA(4,1,4)(1,1,1)[24] 2.680071 4.013859
#> ARIMA(3,1,3)(1,1,1)[24] 2.711371 3.995025
#> ARIMA(3,1,3)(1,1,0)[24] 3.341027 3.758555
#> Auto ARIMA 3.332454 2.890477
We’ve found the winner! Not only performs well on train dataset, Auto ARIMA crashed test dataset as well. This is the best model we made so far.
We will only test the assumptions of the best model (Auto ARIMA). Assumptions on time series data are tested to determine whether the residual obtained from modeling is good enough to explain that the trend and seasonal compositions already captured as much information as possible. The best model is the one with no autocorrelated residuals and the residuals are normally distributed.
To check this, we can observe from the histogram of residuals. Are they normally distributed around zero?
If it’s hard to tell, Shapiro–Wilk test is always our friend.
H0: residuals are normally distributed H1: residuals are not normally distributed
We want p-value > alpha (0.05) so that H0 fails to be rejected and the residuals are normally distributed.
#>
#> Shapiro-Wilk normality test
#>
#> data: auto_arima_model$residuals
#> W = 0.94388, p-value = 0.0000008032
Since p-value is below alpha (0.05), reject H0. Hence, residuals are not normally distributed.
#> train_mape test_mape
#> Holt-Winters 3.800135 3.691772
#> ARIMA(4,1,4)(2,1,1)[24] 2.628641 4.185137
#> ARIMA(4,1,4)(1,1,1)[24] 2.680071 4.013859
#> ARIMA(3,1,3)(1,1,1)[24] 2.711371 3.995025
#> ARIMA(3,1,3)(1,1,0)[24] 3.341027 3.758555
#> Auto ARIMA 3.332454 2.890477
We’ve implemented time series forecasting on hourly mobile game ads dataset. Since the dataset has seasonality, we use Holt-Winters and SARIMA model. We see that Auto ARIMA with parameters ARIMA(3,0,3)(1,1,0)[24] is the best model with 3.33% MAPE on train dataset and 2.89% MAPE on test dataset. However, this model violates normally distributed residuals assumption of time series data.