Introduction

In assignment 2, our group was looking at providing actionable insights that the Studio X (a new entrant into the Hollywood film industry) could use to quantifiably drive movie sales. We have then concluded the following insights based on the analysis:

In this paper, I will dive into more details and see how we can forcast the box office sales using the Time Series Analysis. As StudioX is a new entrant to the Hollywood file industry, it is important that they have ability to accurately forcast the sales in order to minimise the risk and make a correct decision.

Using the data from Assignment 2, I will attempt to forcast the daily box office sales using the movies that yield the highest Return On Investment (ROI) and see if there is any patterns in different market segements. For example, forcast the box office sales for a small budget movie and compare with the forcast for blockbuster movies which aim for bigger audience. This will provide StudioX another set of toolkits to determine the forcasting model based on the strategy they would like to take in the future.

Datasets

Using the data from Assignment 2, we first get the Top 3 movies that have the highest Return On Investment (ROI) from 2015 to 2018. We define the ROI as:

(Gross Box Office Sales - Budget) / Budget * 100

We seperate the movies in two groups:

  • Group 1 - Movies that have budget less than $100m
  • Group 2 - Movies that have budget equal or over $100m (blockbuster)

Here are the top 5 movies the yield the highest ROI for Group 1:

Title Genres Total Gross ($) Total Budget ($) ROI (%)
The Gallows Horror, Thriller 25578900 100000 25479
Get Out Horror, Mystery, Thriller 182376000 4500000 3953
Unfriended Drama, Horror, Mystery 34972000 1000000 3397
God’s Not Dead Drama 69703700 2000000 3385
War Room Drama 72208000 3000000 2307

Here are the top 5 movies the yield the highest ROI for Group 2:

Title Genres Total Gross ($) Total Budget ($) ROI (%)
Jurassic World Action, Adventure, Sci-Fi 730422400 150000000 387
Star Wars: The Force Awakens Action, Adventure, Fantasy 1009345800 245000000 312
Beauty and the Beast (2017) Fantasy 523204000 160000000 227
The Hunger Games: Mockingjay - Part 1 Action, Adventure, Sci-Fi 384883000 125000000 208
Frozen Animation, Adventure, Comedy 459757600 150000000 207

Compare with the two groups, Group 1 has much higher ROI than Group 2.

I will pick the top 2 movies for each group, extract the daily box office sales data from Box office mojo web site (IMDb 2018A, IMDb 2018B, IMDb 2018C and IMDb 2018D), generate the ARIMA model for each one and compare the results.

Methodology

Here is the high level apparoch I am going to use:

  1. Data exploration - plot and examine the time series
  2. Decompose the time series data
  3. Check for stationarity and apply differencing if needed
  4. Perform Autocorrelation function (ACF) and Partial Autocorrelation Function (PACF)
  5. Modelling usingARIMA
  6. Perform forecasting
  7. Evaluate the modelling accuracy
  8. The implication of the result about the research question

Data Exploration

Here are the time series plot of the Top 2 ROI movies in each group:

Group 1 - Low Budget

Group 2 - Big Budget

Both groups have similar downwards trend as all movies generate less sales while they have longer showing time at the theaters and hence none of the time series are stationarity.

Decompose the data

We will use command STL for decomposing time series data. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships (Hyndman & Athanasopoulos 2018, chapter 6.6, para. 1).

Group 1 - Low Budget

Group 2 - Big Budget

  • The graphs above indicate that the trend dominates the variation in the data series. Trend of decreasing box office sales is also confirmed. Please note that grpahs above also indicate two of the movies (Get Out and Star Wars) have variation of the data related to seasonality. It is probably realted increase of the sales over the weekend.

Stationarity and Differencing

The grpahs above indicate the sales are not stationary and hence we will need apply the differencing to stablise the data.

Differencing

Log Return

Unit Root Test

Model and evulation

Split up the training / test data

We use the last 10 observaitons as test data for forecasting and the remaining will be used as a training set to fit the models.

Movie_Gallow_Daily_ts.train <- head(Movie_Gallow_Daily_ma, round(length(Movie_Gallow_Daily_ma) * 0.6))
Movie_Gallow_Daily_ts.h <- length(Movie_Gallow_Daily_ma) - length(Movie_Gallow_Daily_ts.train)
Movie_Gallow_Daily_ts.test <- tail(Movie_Gallow_Daily_ma, Movie_Gallow_Daily_ts.h)

Movie_GetOut_Daily_ts.train <- head(Movie_GetOut_Daily_ma, round(length(Movie_GetOut_Daily_ma) * 0.6))
Movie_GetOut_Daily_ts.h <- length(Movie_GetOut_Daily_ma) - length(Movie_GetOut_Daily_ts.train)
Movie_GetOut_Daily_ts.test <- tail(Movie_GetOut_Daily_ma, Movie_GetOut_Daily_ts.h)

Movie_JurassicWorld_Daily_ts.train <- head(Movie_JurassicWorld_Daily_ma, round(length(Movie_JurassicWorld_Daily_ma) * 0.6))
Movie_JurassicWorld_Daily_ts.h <- length(Movie_JurassicWorld_Daily_ma) - length(Movie_JurassicWorld_Daily_ts.train)
Movie_JurassicWorld_Daily_ts.test <- tail(Movie_JurassicWorld_Daily_ma, Movie_JurassicWorld_Daily_ts.h)

Movie_StarWarsTFW_Daily_ts.train <- head(Movie_StarWarsTFW_Daily_ma, round(length(Movie_StarWarsTFW_Daily_ma) * 0.6))
Movie_StarWarsTFW_Daily_ts.h <- length(Movie_StarWarsTFW_Daily_ma) - length(Movie_StarWarsTFW_Daily_ts.train)
Movie_StarWarsTFW_Daily_ts.test <- tail(Movie_StarWarsTFW_Daily_ma, Movie_StarWarsTFW_Daily_ts.h)

Model using ARIMA

The auto.arima function will be used to find out the p, q and d values that are required by the ARIMA models.

Group 1

Movie_Gallow_Daily_ar2fit = auto.arima(Movie_Gallow_Daily_ts.train)
## Warning in value[[3L]](cond): The chosen test encountered an error, so no
## seasonal differencing is selected. Check the time series data.
Movie_Gallow_Daily_ar2fit
## Series: Movie_Gallow_Daily_ts.train 
## ARIMA(1,2,0)(1,0,0)[30] 
## 
## Coefficients:
##           ar1     sar1
##       -0.2988  -0.0048
## s.e.   0.1531   0.1866
## 
## sigma^2 estimated as 9.015e+10:  log likelihood=-560.27
## AIC=1126.55   AICc=1127.21   BIC=1131.61
Movie_GetOut_Daily_ar2fit = auto.arima(Movie_GetOut_Daily_ts.train)
Movie_GetOut_Daily_ar2fit
## Series: Movie_GetOut_Daily_ts.train 
## ARIMA(0,1,3) with drift 
## 
## Coefficients:
##          ma1      ma2      ma3       drift
##       0.2817  -0.8631  -0.2441  -135052.63
## s.e.  0.1254   0.1103   0.1217    36249.87
## 
## sigma^2 estimated as 1.645e+12:  log likelihood=-959.38
## AIC=1928.76   AICc=1929.83   BIC=1939.4

Group 2

Movie_JurassicWorld_Daily_ar2fit = auto.arima(Movie_JurassicWorld_Daily_ts.train)
Movie_JurassicWorld_Daily_ar2fit
## Series: Movie_JurassicWorld_Daily_ts.train 
## ARIMA(2,1,1)(1,0,0)[30] 
## 
## Coefficients:
##           ar1     ar2     ma1     sar1
##       -0.1323  0.1324  0.3965  -0.0143
## s.e.   0.9055  0.2562  0.8988   0.1814
## 
## sigma^2 estimated as 2.677e+13:  log likelihood=-1618.3
## AIC=3246.6   AICc=3247.26   BIC=3259.42
Movie_StarWarsTFW_Daily_ar2fit = auto.arima(Movie_StarWarsTFW_Daily_ts.train)
Movie_StarWarsTFW_Daily_ar2fit
## Series: Movie_StarWarsTFW_Daily_ts.train 
## ARIMA(2,2,2) 
## 
## Coefficients:
##           ar1      ar2     ma1      ma2
##       -0.9564  -0.4030  0.6014  -0.2401
## s.e.   0.3490   0.3308  0.4291   0.4143
## 
## sigma^2 estimated as 6.527e+13:  log likelihood=-1193.82
## AIC=2397.64   AICc=2398.6   BIC=2408.81

Residual Check

Group 1

checkresiduals(Movie_Gallow_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,2,0)(1,0,0)[30]
## Q* = 12.895, df = 6.4, p-value = 0.05548
## 
## Model df: 2.   Total lags used: 8.4
checkresiduals(Movie_GetOut_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,3) with drift
## Q* = 40.216, df = 8.6, p-value = 4.949e-06
## 
## Model df: 4.   Total lags used: 12.6

Group 2

checkresiduals(Movie_JurassicWorld_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)(1,0,0)[30]
## Q* = 48.03, df = 15.4, p-value = 3.242e-05
## 
## Model df: 4.   Total lags used: 19.4
checkresiduals(Movie_StarWarsTFW_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,2)
## Q* = 8.5145, df = 10.2, p-value = 0.597
## 
## Model df: 4.   Total lags used: 14.2

Forcast and accuracy

Group 1

Movie_Gallow_Daily_test.arima = forecast(Movie_Gallow_Daily_ar2fit, h=Movie_Gallow_Daily_ts.h)
accuracy(Movie_Gallow_Daily_test.arima, Movie_Gallow_Daily_ts.test)
##                    ME      RMSE       MAE       MPE      MAPE       MASE
## Training set 42571.20 285591.80 130567.47  10.97382  45.17572 0.08174466
## Test set     20566.85  22810.52  20566.85 546.13335 546.13335 0.01287633
##                     ACF1 Theil's U
## Training set -0.07885675        NA
## Test set      0.75205950  1.701364
Movie_GetOut_Daily_test.arima = forecast(Movie_GetOut_Daily_ar2fit, h=Movie_GetOut_Daily_ts.h)
accuracy(Movie_GetOut_Daily_test.arima, Movie_GetOut_Daily_ts.test)
##                     ME    RMSE       MAE         MPE        MAPE      MASE
## Training set -122168.7 1230739  858567.4    26.21967    65.63541 0.2286964
## Test set     3874618.6 4176499 3874618.6 12382.42863 12382.42863 1.0320810
##                      ACF1 Theil's U
## Training set -0.008972328        NA
## Test set      0.918331756  363.2936

Group 2

Movie_JurassicWorld_Daily_test.arima = forecast(Movie_JurassicWorld_Daily_ar2fit, h=Movie_JurassicWorld_Daily_ts.h)
accuracy(Movie_JurassicWorld_Daily_test.arima, Movie_JurassicWorld_Daily_ts.test)
##                      ME       RMSE        MAE        MPE      MAPE
## Training set -607307.75 5039008.27 2186663.57  -15.79593  52.00676
## Test set      -19697.84   56241.38   44888.68 -234.13459 249.60750
##                     MASE        ACF1 Theil's U
## Training set 0.253070771 -0.01980368        NA
## Test set     0.005195135  0.63274326  4.400156
Movie_StarWarsTFW_Daily_test.arima = forecast(Movie_StarWarsTFW_Daily_ar2fit, h=Movie_StarWarsTFW_Daily_ts.h)
accuracy(Movie_StarWarsTFW_Daily_test.arima, Movie_StarWarsTFW_Daily_ts.test)
##                    ME    RMSE     MAE          MPE       MAPE      MASE
## Training set  1091269 7730125 4218449     7.372596   75.80312 0.2154086
## Test set     -5238892 5983697 5261883 -6224.462819 6226.05312 0.2686899
##                    ACF1 Theil's U
## Training set 0.06135459        NA
## Test set     0.92387409  219.0232

Compare with other simple forcasting methods

Group 1

Group 2

Stationarity

Differencing

Log Return

Unit Root Test

ACF / PACF

Conclusion

From the analysis above, we can concluded that: *

Reflection

Reference

IMDb 2018A, Daily Box Office Sales - The Gallows, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&id=newlinehorror2015.htm

IMDb 2018B, Movie - Daily Box Office Sales - Get Out, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&id=blumhouse2.htm

IMDb 2018C, Movie - Daily Box Office Sales - Jurassic World, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&view=chart&id=jurassicpark4.htm

IMDb 2018D, Movie - Daily Box Office Sales - Star Wars: The Force Awakens , Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&view=chart&id=starwars7.htm

Lee, J. 2018. ‘Introduction to Time Series Analysis with R’, viewed on 10th October 2018. https://canvas.uts.edu.au/courses/604/files/72884

Hyndman, R and Athanasopoulos, G. 2018. ‘Forecasting: Principles and Practice’, viewed on 10th October 2018. https://otexts.org/fpp2/

Dalinina, R 2017. ‘Introduction to Forecasting with ARIMA in R’, viewed on 20th October 2018. https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials

Tips

Assignment Description

Your report should take the form of a blog post on CIC Around. It should include: 1. The background that led to your new analysis and its aims. 2. A justification of the methodologies that you are using to answer your question 3. The results that you have obtained and what they imply for your research question 4. Conclusions that you can draw from your analysis 5. A reflection upon how your new analysis enhances the insights gained in Assessment Task 2. 6. Please mark your blog post with the STDS-Further-Explorations tag.

Assessment Criteria

40% - Novelty and coherence of the new analysis (including aims, new research questions and argumentation). 30% - Soundness of the statistical methodology that also shows evidence of having applied relevant new analytical methods to address the new research question 30% - Appropriateness of the interpretation applied to the results, where the resulting conclusions are well justified and answer the stated research question.