In assignment 2, our group was looking at providing actionable insights that the Studio X (a new entrant into the Hollywood film industry) could use to quantifiably drive movie sales. We have then concluded the following insights based on the analysis:
In this paper, I will dive into more details and see how we can forcast the box office sales using the Time Series Analysis. As StudioX is a new entrant to the Hollywood file industry, it is important that they have ability to accurately forcast the sales in order to minimise the risk and make a correct decision.
Using the data from Assignment 2, I will attempt to forcast the daily box office sales using the movies that yield the highest Return On Investment (ROI) and see if there is any patterns in different market segements. For example, forcast the box office sales for a small budget movie and compare with the forcast for blockbuster movies which aim for bigger audience. This will provide StudioX another set of toolkits to determine the forcasting model based on the strategy they would like to take in the future.
Using the data from Assignment 2, we first get the Top 3 movies that have the highest Return On Investment (ROI) from 2015 to 2018. We define the ROI as:
(Gross Box Office Sales - Budget) / Budget * 100
We seperate the movies in two groups:
Here are the top 5 movies the yield the highest ROI for Group 1:
| Title | Genres | Total Gross ($) | Total Budget ($) | ROI (%) |
|---|---|---|---|---|
| The Gallows | Horror, Thriller | 25578900 | 100000 | 25479 |
| Get Out | Horror, Mystery, Thriller | 182376000 | 4500000 | 3953 |
| Unfriended | Drama, Horror, Mystery | 34972000 | 1000000 | 3397 |
| God’s Not Dead | Drama | 69703700 | 2000000 | 3385 |
| War Room | Drama | 72208000 | 3000000 | 2307 |
Here are the top 5 movies the yield the highest ROI for Group 2:
| Title | Genres | Total Gross ($) | Total Budget ($) | ROI (%) |
|---|---|---|---|---|
| Jurassic World | Action, Adventure, Sci-Fi | 730422400 | 150000000 | 387 |
| Star Wars: The Force Awakens | Action, Adventure, Fantasy | 1009345800 | 245000000 | 312 |
| Beauty and the Beast (2017) | Fantasy | 523204000 | 160000000 | 227 |
| The Hunger Games: Mockingjay - Part 1 | Action, Adventure, Sci-Fi | 384883000 | 125000000 | 208 |
| Frozen | Animation, Adventure, Comedy | 459757600 | 150000000 | 207 |
Compare with the two groups, Group 1 has much higher ROI than Group 2.
I will pick the top 2 movies for each group, extract the daily box office sales data from Box office mojo web site (IMDb 2018A, IMDb 2018B, IMDb 2018C and IMDb 2018D), generate the ARIMA model for each one and compare the results.
Here is the high level apparoch I am going to use:
Here are the time series plot of the Top 2 ROI movies in each group:
Both groups have similar downwards trend as all movies generate less sales while they have longer showing time at the theaters and hence none of the time series are stationarity.
We will use command STL for decomposing time series data. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships (Hyndman & Athanasopoulos 2018, chapter 6.6, para. 1).
The grpahs above indicate the sales are not stationary and hence we will need apply the differencing to stablise the data.
We use the last 10 observaitons as test data for forecasting and the remaining will be used as a training set to fit the models.
Movie_Gallow_Daily_ts.train <- head(Movie_Gallow_Daily_ma, round(length(Movie_Gallow_Daily_ma) * 0.6))
Movie_Gallow_Daily_ts.h <- length(Movie_Gallow_Daily_ma) - length(Movie_Gallow_Daily_ts.train)
Movie_Gallow_Daily_ts.test <- tail(Movie_Gallow_Daily_ma, Movie_Gallow_Daily_ts.h)
Movie_GetOut_Daily_ts.train <- head(Movie_GetOut_Daily_ma, round(length(Movie_GetOut_Daily_ma) * 0.6))
Movie_GetOut_Daily_ts.h <- length(Movie_GetOut_Daily_ma) - length(Movie_GetOut_Daily_ts.train)
Movie_GetOut_Daily_ts.test <- tail(Movie_GetOut_Daily_ma, Movie_GetOut_Daily_ts.h)
Movie_JurassicWorld_Daily_ts.train <- head(Movie_JurassicWorld_Daily_ma, round(length(Movie_JurassicWorld_Daily_ma) * 0.6))
Movie_JurassicWorld_Daily_ts.h <- length(Movie_JurassicWorld_Daily_ma) - length(Movie_JurassicWorld_Daily_ts.train)
Movie_JurassicWorld_Daily_ts.test <- tail(Movie_JurassicWorld_Daily_ma, Movie_JurassicWorld_Daily_ts.h)
Movie_StarWarsTFW_Daily_ts.train <- head(Movie_StarWarsTFW_Daily_ma, round(length(Movie_StarWarsTFW_Daily_ma) * 0.6))
Movie_StarWarsTFW_Daily_ts.h <- length(Movie_StarWarsTFW_Daily_ma) - length(Movie_StarWarsTFW_Daily_ts.train)
Movie_StarWarsTFW_Daily_ts.test <- tail(Movie_StarWarsTFW_Daily_ma, Movie_StarWarsTFW_Daily_ts.h)
The auto.arima function will be used to find out the p, q and d values that are required by the ARIMA models.
Movie_Gallow_Daily_ar2fit = auto.arima(Movie_Gallow_Daily_ts.train)
## Warning in value[[3L]](cond): The chosen test encountered an error, so no
## seasonal differencing is selected. Check the time series data.
Movie_Gallow_Daily_ar2fit
## Series: Movie_Gallow_Daily_ts.train
## ARIMA(1,2,0)(1,0,0)[30]
##
## Coefficients:
## ar1 sar1
## -0.2988 -0.0048
## s.e. 0.1531 0.1866
##
## sigma^2 estimated as 9.015e+10: log likelihood=-560.27
## AIC=1126.55 AICc=1127.21 BIC=1131.61
Movie_GetOut_Daily_ar2fit = auto.arima(Movie_GetOut_Daily_ts.train)
Movie_GetOut_Daily_ar2fit
## Series: Movie_GetOut_Daily_ts.train
## ARIMA(0,1,3) with drift
##
## Coefficients:
## ma1 ma2 ma3 drift
## 0.2817 -0.8631 -0.2441 -135052.63
## s.e. 0.1254 0.1103 0.1217 36249.87
##
## sigma^2 estimated as 1.645e+12: log likelihood=-959.38
## AIC=1928.76 AICc=1929.83 BIC=1939.4
Movie_JurassicWorld_Daily_ar2fit = auto.arima(Movie_JurassicWorld_Daily_ts.train)
Movie_JurassicWorld_Daily_ar2fit
## Series: Movie_JurassicWorld_Daily_ts.train
## ARIMA(2,1,1)(1,0,0)[30]
##
## Coefficients:
## ar1 ar2 ma1 sar1
## -0.1323 0.1324 0.3965 -0.0143
## s.e. 0.9055 0.2562 0.8988 0.1814
##
## sigma^2 estimated as 2.677e+13: log likelihood=-1618.3
## AIC=3246.6 AICc=3247.26 BIC=3259.42
Movie_StarWarsTFW_Daily_ar2fit = auto.arima(Movie_StarWarsTFW_Daily_ts.train)
Movie_StarWarsTFW_Daily_ar2fit
## Series: Movie_StarWarsTFW_Daily_ts.train
## ARIMA(2,2,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## -0.9564 -0.4030 0.6014 -0.2401
## s.e. 0.3490 0.3308 0.4291 0.4143
##
## sigma^2 estimated as 6.527e+13: log likelihood=-1193.82
## AIC=2397.64 AICc=2398.6 BIC=2408.81
checkresiduals(Movie_Gallow_Daily_ar2fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,2,0)(1,0,0)[30]
## Q* = 12.895, df = 6.4, p-value = 0.05548
##
## Model df: 2. Total lags used: 8.4
checkresiduals(Movie_GetOut_Daily_ar2fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,3) with drift
## Q* = 40.216, df = 8.6, p-value = 4.949e-06
##
## Model df: 4. Total lags used: 12.6
checkresiduals(Movie_JurassicWorld_Daily_ar2fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)(1,0,0)[30]
## Q* = 48.03, df = 15.4, p-value = 3.242e-05
##
## Model df: 4. Total lags used: 19.4
checkresiduals(Movie_StarWarsTFW_Daily_ar2fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,2,2)
## Q* = 8.5145, df = 10.2, p-value = 0.597
##
## Model df: 4. Total lags used: 14.2
Movie_Gallow_Daily_test.arima = forecast(Movie_Gallow_Daily_ar2fit, h=Movie_Gallow_Daily_ts.h)
accuracy(Movie_Gallow_Daily_test.arima, Movie_Gallow_Daily_ts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 42571.20 285591.80 130567.47 10.97382 45.17572 0.08174466
## Test set 20566.85 22810.52 20566.85 546.13335 546.13335 0.01287633
## ACF1 Theil's U
## Training set -0.07885675 NA
## Test set 0.75205950 1.701364
Movie_GetOut_Daily_test.arima = forecast(Movie_GetOut_Daily_ar2fit, h=Movie_GetOut_Daily_ts.h)
accuracy(Movie_GetOut_Daily_test.arima, Movie_GetOut_Daily_ts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -122168.7 1230739 858567.4 26.21967 65.63541 0.2286964
## Test set 3874618.6 4176499 3874618.6 12382.42863 12382.42863 1.0320810
## ACF1 Theil's U
## Training set -0.008972328 NA
## Test set 0.918331756 363.2936
Movie_JurassicWorld_Daily_test.arima = forecast(Movie_JurassicWorld_Daily_ar2fit, h=Movie_JurassicWorld_Daily_ts.h)
accuracy(Movie_JurassicWorld_Daily_test.arima, Movie_JurassicWorld_Daily_ts.test)
## ME RMSE MAE MPE MAPE
## Training set -607307.75 5039008.27 2186663.57 -15.79593 52.00676
## Test set -19697.84 56241.38 44888.68 -234.13459 249.60750
## MASE ACF1 Theil's U
## Training set 0.253070771 -0.01980368 NA
## Test set 0.005195135 0.63274326 4.400156
Movie_StarWarsTFW_Daily_test.arima = forecast(Movie_StarWarsTFW_Daily_ar2fit, h=Movie_StarWarsTFW_Daily_ts.h)
accuracy(Movie_StarWarsTFW_Daily_test.arima, Movie_StarWarsTFW_Daily_ts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 1091269 7730125 4218449 7.372596 75.80312 0.2154086
## Test set -5238892 5983697 5261883 -6224.462819 6226.05312 0.2686899
## ACF1 Theil's U
## Training set 0.06135459 NA
## Test set 0.92387409 219.0232
From the analysis above, we can concluded that: *
IMDb 2018A, Daily Box Office Sales - The Gallows, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&id=newlinehorror2015.htm
IMDb 2018B, Movie - Daily Box Office Sales - Get Out, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&id=blumhouse2.htm
IMDb 2018C, Movie - Daily Box Office Sales - Jurassic World, Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&view=chart&id=jurassicpark4.htm
IMDb 2018D, Movie - Daily Box Office Sales - Star Wars: The Force Awakens , Box Office Mojo, viewed 15th October 2018, https://www.boxofficemojo.com/movies/?page=daily&view=chart&id=starwars7.htm
Lee, J. 2018. ‘Introduction to Time Series Analysis with R’, viewed on 10th October 2018. https://canvas.uts.edu.au/courses/604/files/72884
Hyndman, R and Athanasopoulos, G. 2018. ‘Forecasting: Principles and Practice’, viewed on 10th October 2018. https://otexts.org/fpp2/
Dalinina, R 2017. ‘Introduction to Forecasting with ARIMA in R’, viewed on 20th October 2018. https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials
Your report should take the form of a blog post on CIC Around. It should include: 1. The background that led to your new analysis and its aims. 2. A justification of the methodologies that you are using to answer your question 3. The results that you have obtained and what they imply for your research question 4. Conclusions that you can draw from your analysis 5. A reflection upon how your new analysis enhances the insights gained in Assessment Task 2. 6. Please mark your blog post with the STDS-Further-Explorations tag.
40% - Novelty and coherence of the new analysis (including aims, new research questions and argumentation). 30% - Soundness of the statistical methodology that also shows evidence of having applied relevant new analytical methods to address the new research question 30% - Appropriateness of the interpretation applied to the results, where the resulting conclusions are well justified and answer the stated research question.