Introduction

In assignment 2, our group was looking at providing actionable insights that the Studio X (a new entrant into the Hollywood film industry) could use to quantifiably drive movie sales. We have then concluded the following insights based on the analysis:

A 10% increase in budget is associated with a 1% increase in sales.
Target spend for film could be between $4-$6 Million when targeting a $10 million budget

In this paper, I will dive into more details and see how we can forcast the box office sales using the Time Series Analysis. As StudioX is a new entrant to the Hollywood file industry, it is important that they have ability to accurately forcast the sales in order to minimise the risk and make a correct decision.

Using the data from Assignment 2, I will attempt to forcast the daily box office sales using the movies that yield the highest Return On Investment (ROI) and see if there is any patterns in different market segements. For example, forcast the box office sales for a small budget movie and compare with the forcast for blockbuster movies which aim for bigger audience. This will provide StudioX another set of toolkits to determine the forcasting model based on the strategy they would like to take in the future.

Datasets

Using the data from Assignment 2, we first get the Top 3 movies that have the highest Return On Investment (ROI) from 2015 to 2018. We define the ROI as:

(Gross Box Office Sales - Budget) / Budget * 100

We seperate the movies in two groups:

Group 1 - Movies that have budget less than $100m
Group 2 - Movies that have budget equal or over $100m (blockbuster)

Here are the top 5 movies the yield the highest ROI for Group 1:

Title	Genres	Total Gross ($)	Total Budget ($)	ROI (%)
The Gallows	Horror, Thriller	25578900	100000	25479
Get Out	Horror, Mystery, Thriller	182376000	4500000	3953
Unfriended	Drama, Horror, Mystery	34972000	1000000	3397
God’s Not Dead	Drama	69703700	2000000	3385
War Room	Drama	72208000	3000000	2307

Here are the top 5 movies the yield the highest ROI for Group 2:

Title	Genres	Total Gross ($)	Total Budget ($)	ROI (%)
Jurassic World	Action, Adventure, Sci-Fi	730422400	150000000	387
Star Wars: The Force Awakens	Action, Adventure, Fantasy	1009345800	245000000	312
Beauty and the Beast (2017)	Fantasy	523204000	160000000	227
The Hunger Games: Mockingjay - Part 1	Action, Adventure, Sci-Fi	384883000	125000000	208
Frozen	Animation, Adventure, Comedy	459757600	150000000	207

Compare with the two groups, Group 1 has much higher ROI than Group 2.

I will pick the top 2 movies for each group, extract the daily box office sales data from Box office mojo web site (IMDb 2018A, IMDb 2018B, IMDb 2018C and IMDb 2018D), generate the ARIMA model for each one and compare the results.

Methodology

Here is the high level apparoch I am going to use:

Data exploration - plot and examine the time series
Decompose the time series data
Check for stationarity and apply differencing if needed
Perform Autocorrelation function (ACF) and Partial Autocorrelation Function (PACF)
Modelling usingARIMA
Perform forecasting
Evaluate the modelling accuracy
The implication of the result about the research question

Data Exploration

Here are the time series plot of the Top 2 ROI movies in each group:

Group 1 - Low Budget

Group 2 - Big Budget

Both groups have similar downwards trend as all movies generate less sales while they have longer showing time at the theaters and hence none of the time series are stationarity.

Decompose the data

We will use command STL for decomposing time series data. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships (Hyndman & Athanasopoulos 2018, chapter 6.6, para. 1).

Group 1 - Low Budget

Group 2 - Big Budget

The graphs above indicate that the trend dominates the variation in the data series. Trend of decreasing box office sales is also confirmed. Please note that grpahs above also indicate two of the movies (Get Out and Star Wars) have variation of the data related to seasonality. It is probably realted increase of the sales over the weekend.

Stationarity and Differencing

The grpahs above indicate the sales are not stationary and hence we will need apply the differencing to stablise the data.

Differencing

Log Return

Unit Root Test

Model and evulation

Split up the training / test data

We use the last 10 observaitons as test data for forecasting and the remaining will be used as a training set to fit the models.

Movie_Gallow_Daily_ts.train <- head(Movie_Gallow_Daily_ma, round(length(Movie_Gallow_Daily_ma) * 0.6))
Movie_Gallow_Daily_ts.h <- length(Movie_Gallow_Daily_ma) - length(Movie_Gallow_Daily_ts.train)
Movie_Gallow_Daily_ts.test <- tail(Movie_Gallow_Daily_ma, Movie_Gallow_Daily_ts.h)

Movie_GetOut_Daily_ts.train <- head(Movie_GetOut_Daily_ma, round(length(Movie_GetOut_Daily_ma) * 0.6))
Movie_GetOut_Daily_ts.h <- length(Movie_GetOut_Daily_ma) - length(Movie_GetOut_Daily_ts.train)
Movie_GetOut_Daily_ts.test <- tail(Movie_GetOut_Daily_ma, Movie_GetOut_Daily_ts.h)

Movie_JurassicWorld_Daily_ts.train <- head(Movie_JurassicWorld_Daily_ma, round(length(Movie_JurassicWorld_Daily_ma) * 0.6))
Movie_JurassicWorld_Daily_ts.h <- length(Movie_JurassicWorld_Daily_ma) - length(Movie_JurassicWorld_Daily_ts.train)
Movie_JurassicWorld_Daily_ts.test <- tail(Movie_JurassicWorld_Daily_ma, Movie_JurassicWorld_Daily_ts.h)

Movie_StarWarsTFW_Daily_ts.train <- head(Movie_StarWarsTFW_Daily_ma, round(length(Movie_StarWarsTFW_Daily_ma) * 0.6))
Movie_StarWarsTFW_Daily_ts.h <- length(Movie_StarWarsTFW_Daily_ma) - length(Movie_StarWarsTFW_Daily_ts.train)
Movie_StarWarsTFW_Daily_ts.test <- tail(Movie_StarWarsTFW_Daily_ma, Movie_StarWarsTFW_Daily_ts.h)

Model using ARIMA

The auto.arima function will be used to find out the p, q and d values that are required by the ARIMA models.

Group 1

Movie_Gallow_Daily_ar2fit = auto.arima(Movie_Gallow_Daily_ts.train)

## Warning in value[[3L]](cond): The chosen test encountered an error, so no
## seasonal differencing is selected. Check the time series data.

Movie_Gallow_Daily_ar2fit

## Series: Movie_Gallow_Daily_ts.train 
## ARIMA(1,2,0)(1,0,0)[30] 
## 
## Coefficients:
##           ar1     sar1
##       -0.2988  -0.0048
## s.e.   0.1531   0.1866
## 
## sigma^2 estimated as 9.015e+10:  log likelihood=-560.27
## AIC=1126.55   AICc=1127.21   BIC=1131.61

Movie_GetOut_Daily_ar2fit = auto.arima(Movie_GetOut_Daily_ts.train)
Movie_GetOut_Daily_ar2fit

## Series: Movie_GetOut_Daily_ts.train 
## ARIMA(0,1,3) with drift 
## 
## Coefficients:
##          ma1      ma2      ma3       drift
##       0.2817  -0.8631  -0.2441  -135052.63
## s.e.  0.1254   0.1103   0.1217    36249.87
## 
## sigma^2 estimated as 1.645e+12:  log likelihood=-959.38
## AIC=1928.76   AICc=1929.83   BIC=1939.4

Group 2

Movie_JurassicWorld_Daily_ar2fit = auto.arima(Movie_JurassicWorld_Daily_ts.train)
Movie_JurassicWorld_Daily_ar2fit

## Series: Movie_JurassicWorld_Daily_ts.train 
## ARIMA(2,1,1)(1,0,0)[30] 
## 
## Coefficients:
##           ar1     ar2     ma1     sar1
##       -0.1323  0.1324  0.3965  -0.0143
## s.e.   0.9055  0.2562  0.8988   0.1814
## 
## sigma^2 estimated as 2.677e+13:  log likelihood=-1618.3
## AIC=3246.6   AICc=3247.26   BIC=3259.42

Movie_StarWarsTFW_Daily_ar2fit = auto.arima(Movie_StarWarsTFW_Daily_ts.train)
Movie_StarWarsTFW_Daily_ar2fit

## Series: Movie_StarWarsTFW_Daily_ts.train 
## ARIMA(2,2,2) 
## 
## Coefficients:
##           ar1      ar2     ma1      ma2
##       -0.9564  -0.4030  0.6014  -0.2401
## s.e.   0.3490   0.3308  0.4291   0.4143
## 
## sigma^2 estimated as 6.527e+13:  log likelihood=-1193.82
## AIC=2397.64   AICc=2398.6   BIC=2408.81

Residual Check

Group 1

checkresiduals(Movie_Gallow_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,2,0)(1,0,0)[30]
## Q* = 12.895, df = 6.4, p-value = 0.05548
## 
## Model df: 2.   Total lags used: 8.4

checkresiduals(Movie_GetOut_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,3) with drift
## Q* = 40.216, df = 8.6, p-value = 4.949e-06
## 
## Model df: 4.   Total lags used: 12.6

Group 2

checkresiduals(Movie_JurassicWorld_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)(1,0,0)[30]
## Q* = 48.03, df = 15.4, p-value = 3.242e-05
## 
## Model df: 4.   Total lags used: 19.4

checkresiduals(Movie_StarWarsTFW_Daily_ar2fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,2)
## Q* = 8.5145, df = 10.2, p-value = 0.597
## 
## Model df: 4.   Total lags used: 14.2

Forcast and accuracy

Group 1

Movie_Gallow_Daily_test.arima = forecast(Movie_Gallow_Daily_ar2fit, h=Movie_Gallow_Daily_ts.h)
accuracy(Movie_Gallow_Daily_test.arima, Movie_Gallow_Daily_ts.test)

##                    ME      RMSE       MAE       MPE      MAPE       MASE
## Training set 42571.20 285591.80 130567.47  10.97382  45.17572 0.08174466
## Test set     20566.85  22810.52  20566.85 546.13335 546.13335 0.01287633
##                     ACF1 Theil's U
## Training set -0.07885675        NA
## Test set      0.75205950  1.701364

Movie_GetOut_Daily_test.arima = forecast(Movie_GetOut_Daily_ar2fit, h=Movie_GetOut_Daily_ts.h)
accuracy(Movie_GetOut_Daily_test.arima, Movie_GetOut_Daily_ts.test)

##                     ME    RMSE       MAE         MPE        MAPE      MASE
## Training set -122168.7 1230739  858567.4    26.21967    65.63541 0.2286964
## Test set     3874618.6 4176499 3874618.6 12382.42863 12382.42863 1.0320810
##                      ACF1 Theil's U
## Training set -0.008972328        NA
## Test set      0.918331756  363.2936

Group 2

Movie_JurassicWorld_Daily_test.arima = forecast(Movie_JurassicWorld_Daily_ar2fit, h=Movie_JurassicWorld_Daily_ts.h)
accuracy(Movie_JurassicWorld_Daily_test.arima, Movie_JurassicWorld_Daily_ts.test)

##                      ME       RMSE        MAE        MPE      MAPE
## Training set -607307.75 5039008.27 2186663.57  -15.79593  52.00676
## Test set      -19697.84   56241.38   44888.68 -234.13459 249.60750
##                     MASE        ACF1 Theil's U
## Training set 0.253070771 -0.01980368        NA
## Test set     0.005195135  0.63274326  4.400156

Movie_StarWarsTFW_Daily_test.arima = forecast(Movie_StarWarsTFW_Daily_ar2fit, h=Movie_StarWarsTFW_Daily_ts.h)
accuracy(Movie_StarWarsTFW_Daily_test.arima, Movie_StarWarsTFW_Daily_ts.test)

##                    ME    RMSE     MAE          MPE       MAPE      MASE
## Training set  1091269 7730125 4218449     7.372596   75.80312 0.2154086
## Test set     -5238892 5983697 5261883 -6224.462819 6226.05312 0.2686899
##                    ACF1 Theil's U
## Training set 0.06135459        NA
## Test set     0.92387409  219.0232

Compare with other simple forcasting methods

Group 1

Group 2

Stationarity

Differencing

Log Return

Unit Root Test

ACF / PACF

Conclusion

From the analysis above, we can concluded that: *

Reflection

In the future, we could focuse predicating movies for different genres.
Data from secondary sales market.

Reference

IMDb 2018A, Daily Box Office Sales - The Gallows, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&id=newlinehorror2015.htm

IMDb 2018B, Movie - Daily Box Office Sales - Get Out, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&id=blumhouse2.htm

IMDb 2018C, Movie - Daily Box Office Sales - Jurassic World, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&view=chart&id=jurassicpark4.htm

IMDb 2018D, Movie - Daily Box Office Sales - Star Wars: The Force Awakens , Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&view=chart&id=starwars7.htm

Lee, J. 2018. ‘Introduction to Time Series Analysis with R’, viewed on 10th October 2018. https://canvas.uts.edu.au/courses/604/files/72884

Hyndman, R and Athanasopoulos, G. 2018. ‘Forecasting: Principles and Practice’, viewed on 10th October 2018. https://otexts.org/fpp2/

Dalinina, R 2017. ‘Introduction to Forecasting with ARIMA in R’, viewed on 20th October 2018. https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials

Tips

How do i find the amonamlies for the most profitable movie?
There are different ways to get into the break even quadrant. Cheap with small audience, expensive but bigger audience. Risk especially starts to come in here… how will you start to address this?

Assignment Description

Your report should take the form of a blog post on CIC Around. It should include: 1. The background that led to your new analysis and its aims. 2. A justification of the methodologies that you are using to answer your question 3. The results that you have obtained and what they imply for your research question 4. Conclusions that you can draw from your analysis 5. A reflection upon how your new analysis enhances the insights gained in Assessment Task 2. 6. Please mark your blog post with the STDS-Further-Explorations tag.

Assessment Criteria

40% - Novelty and coherence of the new analysis (including aims, new research questions and argumentation). 30% - Soundness of the statistical methodology that also shows evidence of having applied relevant new analytical methods to address the new research question 30% - Appropriateness of the interpretation applied to the results, where the resulting conclusions are well justified and answer the stated research question.

Time Series Analysis - Forcast Box Office Sales

Benny Lee

13/10/2018

Introduction

Datasets

Methodology

Data Exploration

Group 1 - Low Budget

Group 2 - Big Budget

Decompose the data

Group 1 - Low Budget

Group 2 - Big Budget

Stationarity and Differencing

Differencing

Log Return

Unit Root Test

Model and evulation

Split up the training / test data

Model using ARIMA

Group 1

Group 2

Residual Check

Group 1

Group 2

Forcast and accuracy

Group 1

Group 2

Compare with other simple forcasting methods

Group 1

Group 2

Stationarity

Differencing

Log Return

Unit Root Test

ACF / PACF

Conclusion

Reflection

Reference

Tips

Assignment Description

Assessment Criteria