Time-Series Analysis on Cinema’s Ticket Used and Total Sales

Calvin

1/23/2022

Background

The dataset is found from Kaggle: Cinema Tickets

Reading the Libraries

library(tidyverse)
library(lubridate)
library(forecast) #time series helpful functions
library(padr) #padding missing times
library(MLmetrics) # calculate error
library(tseries) # adf.test

Data Wrangling & EDA

cinema <- read.csv("cinemaTicket_Ref.csv")
glimpse(cinema)

#> Rows: 142,524
#> Columns: 14
#> $ film_code    <int> 1492, 1492, 1492, 1492, 1492, 1492, 1492, 1492, 1492, 149~
#> $ cinema_code  <int> 304, 352, 489, 429, 524, 71, 163, 450, 51, 522, 43, 529, ~
#> $ total_sales  <int> 3900000, 3360000, 2560000, 1200000, 1200000, 1050000, 102~
#> $ tickets_sold <int> 26, 42, 32, 12, 15, 7, 10, 5, 11, 4, 6, 4, 5, 2, 2, 112, ~
#> $ tickets_out  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ show_time    <int> 4, 5, 4, 1, 3, 3, 3, 3, 2, 3, 3, 3, 6, 3, 1, 4, 5, 3, 3, ~
#> $ occu_perc    <dbl> 4.26, 8.08, 20.00, 11.01, 16.67, 0.98, 7.69, 1.57, 0.95, ~
#> $ ticket_price <dbl> 150000.00, 80000.00, 80000.00, 100000.00, 80000.00, 15000~
#> $ ticket_use   <int> 26, 42, 32, 12, 15, 7, 10, 5, 11, 4, 6, 4, 5, 2, 2, 112, ~
#> $ capacity     <dbl> 610.32864, 519.80198, 160.00000, 108.99183, 89.98200, 714~
#> $ date         <chr> "2018-05-05", "2018-05-05", "2018-05-05", "2018-05-05", "~
#> $ month        <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
#> $ quarter      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
#> $ day          <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, ~

range(cinema$date)

#> [1] "2018-02-21" "2018-11-04"

There are only 2 predictors that catch my eye within this dataset that seems to be fitting for time-series analyses and can be quite aggregated or generalized easily for all cinemas and films, which are: Ticket Use (total number of ticket used / sold and not cancelled), and Total Sales (total sale per screening time).

Observing Sum of Ticket Use

cinema_clean_ticketuse <- 
  cinema %>% 
  mutate(date = ymd(date)) %>% 
  group_by(date) %>% 
  summarise(ticket_use = sum(ticket_use)) %>% 
  arrange(date) %>% 
  pad() %>% 
  fill_by_value(value = 0)

head(cinema_clean_ticketuse,30)

#> # A tibble: 30 x 2
#>    date       ticket_use
#>    <date>          <int>
#>  1 2018-02-21        267
#>  2 2018-02-22          0
#>  3 2018-02-23          3
#>  4 2018-02-24          0
#>  5 2018-02-25          0
#>  6 2018-02-26          0
#>  7 2018-02-27          0
#>  8 2018-02-28          0
#>  9 2018-03-01          0
#> 10 2018-03-02          0
#> # ... with 20 more rows

range(cinema_clean_ticketuse$date)

#> [1] "2018-02-21" "2018-11-04"

Seeing the first 30 data above has missing values that we padded to 0, this seems like it would be bad for our data. Therefore we should remove the first 21 data to be used later.

cinema_clean_ticketuse <- 
  cinema %>% 
  mutate(date = ymd(date)) %>% 
  group_by(date) %>% 
  summarise(ticket_use = sum(ticket_use)) %>% 
  arrange(date) %>% 
  pad() %>% 
  fill_by_value(value = 0) %>% 
  tail(-21)

Then let’s see the data in overall view.

ts(data = cinema_clean_ticketuse$ticket_use,
   start = min(cinema_clean_ticketuse$date),
   frequency = 365) %>% 
  autoplot()

Afterwards we will see some decomposition of the time series data, using both decompose() and stl(). stl() function was only mentioned as additional information during class, but I remember that it is an alternative to decompose(), in which it is a Seasonal Decomposition of Time Series by Loess, that is not using the classic moving average to get the Trend, therefore the completeness of the data is more kept.

We will be comparing decompose() vs stl() in the following autoplots.

Weekly

cinema_ts_ticket_weekly <- ts(data = cinema_clean_ticketuse$ticket_use,
                start = min(cinema_clean_ticketuse$date),
                frequency = 7)

cinema_ts_ticket_weekly %>% 
  decompose() %>% 
  autoplot()

cinema_ts_ticket_weekly %>% 
  stl(s.window = 7) %>% 
  autoplot()

cinema_ts_ticket_weekly %>% 
  decompose(type = "multiplicative") %>% 
  autoplot()

Comparing decompose() and stl(), other than the obvious ends of trends that cuts off when using decompose() due to the use of moving average concept, we can also see a comparably smoother trend lines, an indicator that the seasonality extraction is done more precisely and surely helpful for us to understand the data or even use the information during model fitting.

Comparing additive (default) and multiplicative, I’m still not sure on which one to use since both seem quite similar.

Biweekly

cinema_ts_ticket_biweekly <- ts(data = cinema_clean_ticketuse$ticket_use,
                start = min(cinema_clean_ticketuse$date),
                frequency = 14)

cinema_ts_ticket_biweekly %>% 
  decompose() %>% 
  autoplot()

cinema_ts_ticket_biweekly %>% 
  decompose(type = "multiplicative") %>% 
  autoplot()

cinema_ts_ticket_biweekly %>% 
  stl(s.window = "periodic") %>% 
  autoplot()

cinema_ts_ticket_biweekly %>% 
  stl(s.window = 14) %>% 
  autoplot()

Comparing to weekly, the trend line is more smoother and has less bumps, which indicates that the seasonality patterns are caught rather well.

Also, once again, not sure whether we can use additive or multiplicative type later on the model fitting stage, since the range of the data seems changing (indicative of a “multiplicative”), but quite fluctuative (one period the max and min of seasonal patterns seems getting smaller, in another it is getting larger).

Monthly

cinema_ts_ticket_monthly <- ts(data = cinema_clean_ticketuse$ticket_use,
                start = min(cinema_clean_ticketuse$date),
                frequency = 30)

cinema_ts_ticket_monthly %>% 
  decompose() %>% 
  autoplot()

cinema_ts_ticket_monthly %>% 
  stl(s.window = "periodic") %>% 
  autoplot()

cinema_ts_ticket_monthly %>% 
  stl(s.window = 30) %>% 
  autoplot()

Comparatively, the Trend line here is the smoothest, but then again the residuals get affected and fluctuates quite wildly.

Observing Sum of Total Sales

cinema_clean_totalsales <- 
  cinema %>% 
  mutate(date = ymd(date)) %>% 
  group_by(date) %>% 
  summarise(total_sales = sum(total_sales)) %>% 
  arrange(date) %>% 
  pad() %>% 
  fill_by_value(value = 0)

head(cinema_clean_totalsales,30)

#> # A tibble: 30 x 2
#>    date       total_sales
#>    <date>           <dbl>
#>  1 2018-02-21    32030000
#>  2 2018-02-22           0
#>  3 2018-02-23      180000
#>  4 2018-02-24           0
#>  5 2018-02-25           0
#>  6 2018-02-26           0
#>  7 2018-02-27           0
#>  8 2018-02-28           0
#>  9 2018-03-01           0
#> 10 2018-03-02           0
#> # ... with 20 more rows

Similarly, the 21 first data are mostly populated by 0, therefore we should remove the first 21 time-series data of Total Sales.

cinema_clean_totalsales <- 
  cinema %>% 
  mutate(date = ymd(date)) %>% 
  group_by(date) %>% 
  summarise(total_sales = sum(total_sales)) %>% 
  arrange(date) %>% 
  pad() %>% 
  fill_by_value(value = 0) %>% 
  tail(-21)

Then let’s see the data in overall view.

ts(data = cinema_clean_totalsales$total_sales,
   start = min(cinema_clean_totalsales$date),
   frequency = 365) %>% 
  autoplot()

Weekly

cinema_ts_sales_weekly <- ts(data = cinema_clean_totalsales$total_sales,
                start = min(cinema_clean_totalsales$date),
                frequency = 7)

cinema_ts_sales_weekly %>% 
  decompose() %>% 
  autoplot()

cinema_ts_sales_weekly %>% 
  decompose(type = "multiplicative") %>% 
  autoplot()

Biweekly

cinema_ts_sales_biweekly <- ts(data = cinema_clean_totalsales$total_sales,
                start = min(cinema_clean_totalsales$date),
                frequency = 14)

cinema_ts_sales_biweekly %>% 
  decompose() %>% 
  autoplot()

cinema_ts_sales_biweekly %>% 
  decompose(type = "multiplicative") %>% 
  autoplot()

cinema_ts_sales_biweekly %>% 
  stl(s.window = "periodic") %>% 
  autoplot()

cinema_ts_sales_biweekly %>% 
  stl(s.window = 14) %>% 
  autoplot()

Monthly

cinema_ts_sales_monthly <- ts(data = cinema_clean_totalsales$total_sales,
                start = min(cinema_clean_totalsales$date),
                frequency = 30)

cinema_ts_sales_monthly %>% 
  decompose() %>% 
  autoplot()

cinema_ts_sales_monthly %>% 
  stl(s.window = "periodic") %>% 
  autoplot()

cinema_ts_sales_monthly %>% 
  stl(s.window = 7) %>% 
  autoplot()

cinema_ts_sales_monthly %>% 
  stl(s.window = 14, robust = T) %>% 
  autoplot()

Overall, the observations are similar compared to using the Ticket Use.

Sure, the graph is not too clear whether to use “additive” or “multiplicative” approach on Ticket Use data, but intuitively, since each cinema’s Total Sales should be calculated by the number of tickets sold multiplied by ticket price, I would assume that the Total Sales would be better is using a multiplicative approach.

Cross-Validation

range(cinema$date)

#> [1] "2018-02-21" "2018-11-04"

Since we have around 8+ months of daily data, I think seeting aside 1 last month for testing data, while the rest should be used for training, would be plenty.

cinema_ticket_weekly_test <- tail(cinema_ts_ticket_weekly, 30)
cinema_ticket_weekly_train <- head(cinema_ts_ticket_weekly, -30)

cinema_ticket_biweekly_test <- tail(cinema_ts_ticket_biweekly, 30)
cinema_ticket_biweekly_train <- head(cinema_ts_ticket_biweekly, -30)

cinema_ticket_monthly_test <- tail(cinema_ts_ticket_monthly, 30)
cinema_ticket_monthly_train <- head(cinema_ts_ticket_monthly, -30)

cinema_sales_weekly_test <- tail(cinema_ts_sales_weekly, 30)
cinema_sales_weekly_train <- head(cinema_ts_sales_weekly, -30)

cinema_sales_biweekly_test <- tail(cinema_ts_sales_biweekly, 30)
cinema_sales_biweekly_train <- head(cinema_ts_sales_biweekly, -30)

cinema_sales_monthly_test <- tail(cinema_ts_sales_monthly, 30)
cinema_sales_monthly_train <- head(cinema_ts_sales_monthly, -30)

Model Fitting and Evaluation

Since we can obviously see that there is trend and seasonal from our data during decomposing our time series data, using Holt-Winters Exponential Smoothing (or Triple Exponential Smoothing), since the Simple and Double version of exponential smoothing methods only works if there is any missing components out of trend or seasonal patterns.

Other than that, we will also be trying Seasonal ARIMA (SARIMA), and Automatic ARIMA for comparison. Also, STLM (Seasonal Trend with Loess Model) was not taught formally during the class and only included as an additional information, and it would interesting to see how it fares against other models that we have and most normally used for time-series analyses purposes.

Holt-Winters Exponential Smoothing

Seeing from the EDA and overall plot of our data, it wasn’t so clear whether it’s additive or multiplicative, since although the seasonal minimum and maximum seems changing, but the amount of change could become higher or lower all of the sudden. Besides, during my testing stage when making this report, I found out that multiplicative seasonal model will result in an error if there is any 0 values, which if you recall, we padded the missing data with 0.

Therefore we will move forward and use the default or the additive seasonality for now.

Ticket Use in Weekly Seasonality

Padding the training data with the forecasted 30 days of total tickets bought and used:

cinema_ticket_w <- HoltWinters(x = cinema_ticket_weekly_train, seasonal = "additive")
fc_cinema_ticket_w <- forecast(object = cinema_ticket_w, h = 30)

cinema_ticket_weekly_train %>% 
  autoplot() +
  autolayer(fc_cinema_ticket_w, series = "forecast")

Visually comparing the actual testing data with the forecasted values:

cinema_ticket_weekly_test %>% 
  autoplot()+
  autolayer(fc_cinema_ticket_w$mean, series = "forecast")

Although the forecast was rather simple, the shapes was actually quite similar with the actual data.

Calculating the Mean Absolute Percentage Error (MAPE) to evaluate our forecasting accuracy:

MAPE(y_pred = fc_cinema_ticket_w$mean, y_true = cinema_ticket_weekly_test)

#> [1] 100.6743

100.67%, quite a high number of an error. Let’s try and see if we can minimize this error other possible seasonality or using other methods.

Ticket Use in Bi-weekly Seasonality

Padding the training data with the forecasted 30 days of total tickets bought and used:

cinema_ticket_bw <- HoltWinters(x = cinema_ticket_biweekly_train, seasonal = "additive")
fc_cinema_ticket_bw <- forecast(object = cinema_ticket_bw, h = 30)

cinema_ticket_biweekly_train %>% 
  autoplot() +
  autolayer(fc_cinema_ticket_bw, series = "forecast")

Visually comparing the actual testing data with the forecasted values:

cinema_ticket_biweekly_test %>% 
  autoplot()+
  autolayer(fc_cinema_ticket_bw$mean, series = "forecast")

The forecast seems quite good at the beginning, but then in the end it is getting unfortunately worsened.

Calculating the Mean Absolute Percentage Error (MAPE) to evaluate our forecasting accuracy:

MAPE(y_pred = fc_cinema_ticket_bw$mean, y_true = cinema_ticket_biweekly_test)

#> [1] 81.32678

81.33%, we are getting such a high error. This is a bit better than our weekly seasonality, therefore we will keep this model for now.

Ticket Use in Monthly Seasonality

Padding the training data with the forecasted 30 days of total tickets bought and used:

cinema_ticket_m <- HoltWinters(x = cinema_ticket_monthly_train, seasonal = "additive")
fc_cinema_ticket_m <- forecast(object = cinema_ticket_m, h = 30)

cinema_ticket_monthly_train %>% 
  autoplot() +
  autolayer(fc_cinema_ticket_m, series = "forecast")

Visually comparing the actual testing data with the forecasted values:

cinema_ticket_monthly_test %>% 
  autoplot()+
  autolayer(fc_cinema_ticket_m$mean, series = "forecast")

The seasonality is rather not caught at all by this model, since the actual vs forecasted plotted line graph looks really different. Then again, it stays quite in the middle of everything. I wonder how the MAPE would be.

MAPE(y_pred = fc_cinema_ticket_m$mean, y_true = cinema_ticket_monthly_test)

#> [1] 24.1597

24.16%, surprisingly it’s the lowest error comparing to the others up until now.

Total Sales in Weekly Seasonality

cinema_sales_w <- HoltWinters(x = cinema_sales_weekly_train, seasonal = "additive")
fc_cinema_sales_w <- forecast(object = cinema_sales_w, h = 30)

Padding the training data with the forecasted 30 days of total sales:

cinema_sales_weekly_train %>% 
  autoplot() +
  autolayer(fc_cinema_sales_w, series = "forecast")

Visually comparing the actual testing data with the forecasted values:

cinema_sales_weekly_test %>% 
  autoplot()+
  autolayer(fc_cinema_sales_w$mean, series = "forecast")

Although the forecast was rather simple, the shapes was actually quite similar with the actual data.

Calculating the Mean Absolute Percentage Error (MAPE) to evaluate our forecasting accuracy:

MAPE(y_pred = fc_cinema_sales_w$mean, y_true = cinema_sales_weekly_test)

#> [1] 308.5843

308.58%, very high number of an error.

Total Sales in Bi-weekly Seasonality

Padding the training data with the forecasted 30 days of total sales:

cinema_sales_bw <- HoltWinters(x = cinema_sales_biweekly_train, seasonal = "additive")
fc_cinema_sales_bw <- forecast(object = cinema_sales_bw, h = 30)

cinema_sales_biweekly_train %>% 
  autoplot() +
  autolayer(fc_cinema_sales_bw, series = "forecast")

Visually comparing the actual testing data with the forecasted values:

cinema_sales_biweekly_test %>% 
  autoplot()+
  autolayer(fc_cinema_sales_bw$mean, series = "forecast")

The forecast seems quite good at the beginning, but then in the end it is getting unfortunately worsened.

Calculating the Mean Absolute Percentage Error (MAPE) to evaluate our forecasting accuracy:

MAPE(y_pred = fc_cinema_sales_bw$mean, y_true = cinema_sales_biweekly_test)

#> [1] 184.2316

184.23%, we are getting such a high error. This is still a bad error rate, therefore we will disregard this model for now.

Total Sales in Monthly Seasonality

Padding the training data with the forecasted 30 days of total sales:

cinema_sales_m <- HoltWinters(x = cinema_sales_monthly_train, seasonal = "additive")
fc_cinema_sales_m <- forecast(object = cinema_sales_m, h = 30)

cinema_sales_monthly_train %>% 
  autoplot() +
  autolayer(fc_cinema_sales_m, series = "forecast")

Visually comparing the actual testing data with the forecasted values:

cinema_sales_monthly_test %>% 
  autoplot()+
  autolayer(fc_cinema_sales_m$mean, series = "forecast")

MAPE(y_pred = fc_cinema_sales_m$mean, y_true = cinema_sales_monthly_test)

#> [1] 74.4517

74.45%, best error rate out of the total sales data, but still worse than when we are using ticket used.

Seasonal ARIMA

Seasonal ARIMA (Autoregressive Integrated Moving Average) should be more fitting for time-series objects having seasonal patterns. To do so, we need to find out the best overall parameters (p, d, q), and seasonal parameters (P, D, Q) to let the analysis commence.

D and d

Looking again at the overall data:

ts(data = cinema_clean_ticketuse$ticket_use,
   start = min(cinema_clean_ticketuse$date),
   frequency = 365) %>% 
  autoplot()

We could try some differencing to make our data more stationary, to be more suitable for an ARIMA model.

First, I’m going to try a weekly seasonal differencing on the training data.

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -5.775, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  autoplot()

Since the p-value (0.01) < default alpha (0.05) for hypothesis testing, therefore we can reject null hypothesis, and say that the data is quite stationary now.

But looking from the plotting, it doesn’t seem to stationary, in that it doesn’t revolve around X-axis too much. We should try another weekly seasonal differencing.

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -6.592, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  autoplot()

Still, p-value is < alpha, therefore we can use

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -6.7838, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  autoplot()

Still looks good. We will stop after trying one more time.

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -7.0698, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  autoplot()

Somehow looks better from the plotting, and the p-value still stays at 0.01. So for now, we can say that D = 4, since we have done weekly seasonal differencing for 4 times until the data is deemed as quite stationary.

Then let’s look for the overall differencing that we need.

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -4.358, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  autoplot()

Looks good already, and the p-value still falls under 0.05. Try once again:

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -5.5842, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  autoplot()

Looks better with lower p-value also. We could try around 2 more differencing, since usually a large try of differencing is unheard of, so we will limit ourselves until maximum 4 tries.

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -13.639, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  autoplot()

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -22.371, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  autoplot()

Since we can also do 4 times more overall differencing, then d = 4.

P, p, Q, and q

cinema_sales_weekly_train %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 7) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  diff(lag = 1) %>% 
  tsdisplay(lag.max = 35)

We will look at the PACF graph to deduce the P and p values.
Since we are considering a weekly seasonality, for the P value we should try looking at multiples of 7. We can see that all of the values on the 14th, 28th, 35th; all do not pass the blue line of tolerance. Therefore we will use P = 0

Meanwhile for the overall, we can see that 1st - 5th values passes way over the blue line, therefore we will try the smallest successive value from 1 that still passes the blue line, which would be p = 5

Similarly let’s look for ACF to fill in our Q and q values.
The 7th and 14th both passes the blue line, though they differ in positive and negative values, so we will try them both, therefore our candidates for the values of Q = 1 or 2.

Meanwhile, until the 19th lag, it seems that most pass the blue line. Since a higher number could cause an error within the algorithm, I will try around q = 4 or 5

Parameter Candidates Summary

Our Seasonal ARIMA would follow this pattern to filled: SARIMA(p,d,q)(P,D,Q)[seasonal] So here is our candidates so far: - p = 5 - d = 4 (although 0-4 are actually all candidates, but that would be too many possibilities to try) - q = 4 or 5 - P = 0 - D = 4 (similar to d values) - Q = 1 or 2 - seasonal = 7 (since we are considering weekly for now)

Model Fitting & Pre-evaluation

Considering some candidates, we would have 8 possible SARIMA models to try. Let’s try and fit all of them and see which ones perform the best with least errors.

Note: using d = 4 and even d = 3 caused some errors of “initial value in ‘vmmin’ is not finite”, but 2 and below seems safe and the ARIMA function works. Since the candidates are 0-4 anyway, I will put 2 for the d value.

ticket_w_arima_544021 <- arima(x = cinema_ticket_weekly_train, order = c(5,4,4), seasonal = c(0,2,1))
ticket_w_arima_544022 <- arima(x = cinema_ticket_weekly_train, order = c(5,4,4), seasonal = c(0,2,2))
ticket_w_arima_545021 <- arima(x = cinema_ticket_weekly_train, order = c(5,4,5), seasonal = c(0,2,1))
ticket_w_arima_545022 <- arima(x = cinema_ticket_weekly_train, order = c(5,4,5), seasonal = c(0,2,2))

ticket_w_arima_544021$aic

#> [1] 4571.459

ticket_w_arima_544022$aic

#> [1] 4572.606

ticket_w_arima_545021$aic

#> [1] 4559.675

ticket_w_arima_545022$aic

#> [1] 4571.829

The AIC scores are all pretty similar.
Since AIC measure information lost, we can choose the the smallest one. For now it seems that models ticket_w_arima_545021 and ticket_w_arima_544021 are our top 2 performer out of these 8 models.

accuracy(ticket_w_arima_544021)

#>                    ME     RMSE      MAE MPE MAPE      MASE       ACF1
#> Training set 1888.562 37794.68 27204.19 NaN  Inf 0.5427334 0.03586424

accuracy(ticket_w_arima_544022)

#>                    ME     RMSE      MAE MPE MAPE     MASE       ACF1
#> Training set 1915.689 37744.62 26872.15 NaN  Inf 0.536109 0.04238482

accuracy(ticket_w_arima_545021)

#>                    ME     RMSE      MAE MPE MAPE      MASE       ACF1
#> Training set 1209.311 35369.21 24737.61 NaN  Inf 0.4935242 -0.1161906

accuracy(ticket_w_arima_545022)

#>                    ME     RMSE      MAE MPE MAPE      MASE        ACF1
#> Training set 831.4483 38035.18 27036.66 NaN  Inf 0.5393911 -0.08674314

It would be preferred to compare the MAPE, but it seems to be Inf due to something wrong the in the processing. Therefore, since we would need to see the least errors, looking from the MAE (Mean Absolute Error) and MASE (Mean Absolute Squared Error), it supports the top performer based on AIC before, which is ticket_w_arima_545021.

Moving forward, we will only use this top model to evaluate against other models.

Forecasting

fc_ticket_w_arima_545021 <- forecast(ticket_w_arima_545021, h = 30)

Model Evaluation

Visually comparing the actual testing data with the forecasted values:

cinema_ticket_weekly_test %>% 
  autoplot()+
  autolayer(fc_ticket_w_arima_545021$mean, series = "ARIMA(5,2,1)(4,4,2)")

The seasonality is caught rather well by these model, but after quite close forecast at few observations, the trend seems so way off later on.

MAPE(y_pred = fc_ticket_w_arima_545021$mean, y_true = cinema_ticket_weekly_test)

#> [1] 6.612768

6.61% error! This is one of our best model.

Auto-ARIMA

Similar to SARIMA, but now we can let the algorithm pick the best possible parameters for the ARIMA. They sure have some limitations, but we can use them as quite the baseline if we are about to overthink our parameter selection.

Model Fitting & Pre-evaluation

This should be quite simple:

ticket_w_auto_arima <- auto.arima(cinema_ticket_weekly_train, seasonal = T)
summary(ticket_w_auto_arima)

#> Series: cinema_ticket_weekly_train 
#> ARIMA(1,0,2)(2,1,0)[7] 
#> 
#> Coefficients:
#>          ar1      ma1     ma2     sar1     sar2
#>       0.8647  -0.4681  0.0389  -0.2097  -0.0368
#> s.e.  0.0568   0.0939  0.0832   0.0802   0.0851
#> 
#> sigma^2 estimated as 959708693:  log likelihood=-2338.12
#> AIC=4688.24   AICc=4688.68   BIC=4708
#> 
#> Training set error measures:
#>                    ME     RMSE      MAE MPE MAPE      MASE        ACF1
#> Training set 450.6323 30063.32 19014.96 NaN  Inf 0.7541273 0.003883474

As we can see from the auto-ARIMA above, the algorithm pick ARIMA(1,0,2)(2,1,0)[7] as its best parameters. Let’s see how it compares.

Looking at the AIC score, which is 4688.24, which is not too different than AIC of our SARIMA with manually picked parameters, having AIC of 4559.675, just a bit higher.

Forecasting

fc_ticket_w_auto_arima <- forecast(ticket_w_auto_arima, h = 30)

Model Evaluation

Visually comparing the actual testing data with the forecasted values:

cinema_ticket_weekly_test %>% 
  autoplot()+
  autolayer(fc_ticket_w_auto_arima$mean, series = "Auto ARIMA")

The seasonality is caught rather well by these model, and the trend seems to be much better.

MAPE(y_pred = fc_ticket_w_auto_arima$mean, y_true = cinema_ticket_weekly_test)

#> [1] 69.98608

Surprisingly projecting a 69.986% error, much higher than our SARIMA using manually fetched parameters, therefore it’s not so good.

Seasonal Trend with Loess Model (STLM)

While researching about this case, some comments said that the ts() and decompose() do not work well for a dataset like ours, which is a daily observations that is about to be analyzed in a weekly (or even monthly) basis. They recommend using zoo library, or stl() to decompose our model using Loess Model.

This is mentioned a little during class in that I only remember that it’s main benefit is that it has a smoother decomposing method, and not simply using a Moving Average method that is leaving empty values especially in the beginning and end of the trend component.

The stl() method is already done during Data Wrangling & EDA section, and we have already seen such smoother trend line that is extracted from the data, compared to the decompose(). We will use the stlm function to automatically find the best ARIMA using Loess Model.

Weekly Seasonality

Model Fitting & Pre-evaluation

ticket_w_stlm_arima <- 
  cinema_ticket_weekly_train %>% 
  stlm(s.window = 7, method = "arima")

ticket_w_stlm_arima$model$aic

#> [1] 4673.066

Looking at the AIC score, which is 4673.066, which is similar to the best performing ones from before.

Forecasting

fc_ticket_w_stlm_arima <- forecast(object = ticket_w_stlm_arima, h = 30)

Model Evaluation

cinema_ticket_weekly_test %>% 
  autoplot()+
  autolayer(fc_ticket_w_stlm_arima$mean, series = "STLM ARIMA")

The seasonality is quite well caught, and the trend seems to be quite good.

MAPE(y_pred = fc_ticket_w_stlm_arima$mean, y_true = cinema_ticket_weekly_test)

#> [1] 63.17171

Quite high of 63.17% error rate, considering we found a model with a very low around 6% MAPE error rate.

Bi-weekly Seasonality

Model Fitting & Pre-evaluation

ticket_bw_stlm_arima <- 
  cinema_ticket_biweekly_train %>% 
  stlm(s.window = 14, method = "arima")

ticket_bw_stlm_arima$model$aic

#> [1] 4846.76

Looking at the AIC score, which is 4846.76, which is quite similar to the best performing ones from before.

Forecasting

fc_ticket_bw_stlm_arima <- forecast(object = ticket_bw_stlm_arima, h = 30)

Model Evaluation

cinema_ticket_biweekly_test %>% 
  autoplot()+
  autolayer(fc_ticket_bw_stlm_arima$mean, series = "STLM ARIMA")

The seasonality is quite well caught, and the trend seems to be quite good.

MAPE(y_pred = fc_ticket_bw_stlm_arima$mean, y_true = cinema_ticket_biweekly_test)

#> [1] 79.37106

Quite high of 79.37% error rate, considering that using a weekly seasonality it’s lower.

Monthly Seasonality

Model Fitting & Pre-evaluation

ticket_m_stlm_arima <- 
  cinema_ticket_monthly_train %>% 
  stlm(s.window = 30, method = "arima")

ticket_m_stlm_arima$model$aic

#> [1] 4964.553

Looking at the AIC score, which is 4964.553, which is quite similar to the best performing ones from before.

Forecasting

fc_ticket_m_stlm_arima <- forecast(object = ticket_m_stlm_arima, h = 30)

Model Evaluation

cinema_ticket_monthly_test %>% 
  autoplot()+
  autolayer(fc_ticket_bw_stlm_arima$mean, series = "STLM ARIMA")

The seasonality is quite well caught, and the trend seems to be quite good.

MAPE(y_pred = fc_ticket_bw_stlm_arima$mean, y_true = cinema_ticket_biweekly_test)

#> [1] 79.37106

Quite high of 79.37% error rate, considering that using a weekly seasonality it’s lower.

Assumptions of Time-Series

Best forecasting methods are assumed to have residuals (error) with these characteristics:

No autocorrelation between the residuals, or simply no correlation between residuals and their own lagging/future data points.
Residuals spread normally, or simply having average close to 0.

No-autocorrelation between Residuals

# using Ljung-Box test
Box.test(x = ticket_w_arima_545021$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  ticket_w_arima_545021$residuals
#> X-squared = 2.8217, df = 1, p-value = 0.09299

We want to accept null hypothesis of “residual has no-autocorrelation”, therefore we would need the p-value to be higher than alpha. Using a default alpha of 0.05 (5% error tolerance), we can see above that our p-value of 0.09299 is > 0.05, therefore our best time-series forecasting model using Seasonal ARIMA (ticket_w_arima_545021) passed this assumption of no autocorrelation between residuals.

Residuals Normality

shapiro.test(ticket_w_arima_545021$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  ticket_w_arima_545021$residuals
#> W = 0.96456, p-value = 0.00004874

Similar case to the no-autocorrelation assumption, we want to accept null hypothesis of “residuals spread in a normal distribution”. Since our p-value (0.00004874) < alpha (0.05), this means that we need to reject null hypothesis, which means that our model’s residuals are not spreading normally, or our best model has not passed this assumption.

Conclusions

Our best model is ticket_w_arima_545021, which is produced by Seasonal ARIMA, using ticket use data, observed for monthly seasonality patterns. It has 6.61% MAPE (Mean Absolute Percentage Error) rate, the lowest out of all methods that we have tried, including STLM and Auto-ARIMA.
Unfortunately, our model hasn’t passed the Normality assumption, therefore we can conclude that this dataset might not be best suited for a time-series analysis, or some further model tuning can be done to improve our model.

About Me

Hi! My name is Calvin, I am from Jakarta, Indonesia. I am looking forward to be a full-time data analyst and/or data scientist. I have a background in Mathematics and Computer Science from my Bachelor’s Degrees, and I love playing with numbers and data. I am doing this to enhance my Data Science portfolio (constructive criticism is very much welcomed!), also as part of Learn-By-Building assignment at Algoritma Data Science School.

You can reach me at my LinkedIn for more discussion. Thank you!