Abstract Inspired by the challenging in Kaggle’s competitions, as well as my expectation to practice as much as possible data science techniques I have learned, in this project I used the dataset from Kaggle TMDb Box Office Competition and publicly additional data to develop a machine learning model to predict movie’s revenue.

This documents consist 4 sections:

Introduction

Background

In general, the film-industry is having a long history of development and in a constant growth trend now. Starting in 1929, during the Great Depression and the Golden Age of Hollywood, an insight began to evolve related to the consumption of movie tickets. It appeared that even in that bad economic period, the film industry kept growing. The phenomenon repeated in the 2008 recession. The global box office was worth 41.7 billion in 2018, that would rep a 2.7% upwards shift from 2017 $40.6B and mark only the second time ever that it’s cracked $40B. Revenue from North America and Canada came in at a best-ever $11.9B, a mammoth 7% spike over 2017, while attendance was up 5%.

From business point of view, one of the main interests of the film studios and its related stakeholders is a prediction of revenue that a new movie can generate based on a few given input attributes before its released date.

TMDB Box Office Competition

“Can you predict a movie’s worldwide box office revenue?”

On February \(9^{th}, 2019\) Kaggle launch a competition, using the TMDB Box Office dataset to predict a movie’s worldwide box office revenue. This competition was end by May \(31^{th}\), 2019 which total 1398 teams, 1618 competitors and 19,034 entries. Here come the introduction from Kaggle:

“In a world… where movies made an estimated $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it’s”You had me at ‘Hello.’" For others, the trailer falls short of expectations and you think “What we have here is a failure to communicate.”

In this competition, you’re presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. You can collect other publicly available data to use in your model predictions, but in the spirit of this competition, use only data that would have been available before a movie’s release." [1]

Competition evaluation

To evaluate results in Kaggle’s competition, competitors must develop machine learning model to predict the international box office revenue for each movie. For each id in the test set, competitors must predict the value of the revenue variable.

Submissions are evaluated on Root-Mean-Squared-Logarithmic-Error (RMSLE) between the predicted value and the actual revenue. Logs are taken to not overweight blockbuster revenue movies. [1]

With:

  • \(y_{i}\) the true revenue of movie i

  • \(\hat{y}_{i}\) the predicted revenue of movie i

Since this competition completed on May \(31^{st}\), following is the final public leaderboard.

Noted: even Kaggle’s competition rules mentioned “use only data that would have been available before a movie’s release”, many competitors trained there model with the data available after the movie’s released date. [2]

Project objective

The primary goal is to build a machine-learning model to predict the revenue of a new movie given such features as budget, release dates, genres, production companies, production countries… The modeling performance is evaluating based on the RMSLE, which is same with the Kaggle TMDb Box Office Competition. There is no targeted RMSLE required to achieve in this project.

The secondary goal is to practice skills data wrangling, data visualization, reporting using stringr, dplyr, ggplot2, rmarkdown, kableExtra.

Project methodology

This project has 4 high-level steps:

  • Step 1: data overview and pre-processing overviewing and cleaning data using additional data from The Movie Database and Wikipedia.

  • Step 2: data exploratory analysis and features engineering explore and visualize the data to have an overview with-in and between the variables, what’s insights gained and what’s new features added in. Main package for this step is tidyverse, to handle the cleaning, exploring and visualizing tasks. Output of this step is a set of variables for modeling experiment and training.

  • Step 3: modeling experiments design and conduct a set of experiments to evaluate performance and select machine learning method, compare and select features selection approach, evaluate modeling performance before and after apply log-transformation.

  • Step 4: final evaluate the model on the validation set using RMSLE.

Data overview

Kaggle competition’s dataset

The dataset used in this competition has been collected from the Movies Database [7]. The movie details, credits and keywords have been collected from the TMDB Open API. This competition uses the TMDB API but is not endorsed or certified by TMDB. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows.

This data contain 7398 movies and a variety of metadata. Movies are labeled with id. Data points include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. The dataset was subset to 2 dataset, the train.csv has 3000 movies and test.csv with 4398 movies.

All datas which were used in this project, are able to download from my github repo.

Import data

Data overview

The train_set have 3000 observations of 23 variables.

## [1] 3000   23

The test_set has 4398 observations of 22 variables.

## [1] 4398   22

Combine train and test dataset to one data to save time for any data pre-processing and transforming later on.

Lets take a glimpse at Kaggle’s dataset to get a feel of how it looks like.

## Observations: 7,398
## Variables: 23
## $ ï..id                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ belongs_to_collection <chr> "[{'id': 313576, 'name': 'Hot Tub Time M...
## $ budget                <int> 14000000, 40000000, 3300000, 1200000, NA...
## $ genres                <chr> "[{'id': 35, 'name': 'Comedy'}]", "[{'id...
## $ homepage              <chr> NA, NA, "http://sonyclassics.com/whiplas...
## $ imdb_id               <chr> "tt2637294", "tt0368933", "tt2582802", "...
## $ original_language     <chr> "en", "en", "en", "hi", "ko", "en", "en"...
## $ original_title        <chr> "Hot Tub Time Machine 2", "The Princess ...
## $ overview              <chr> "When Lou, who has become the \"father o...
## $ popularity            <dbl> 6.575393, 8.248895, 64.299990, 3.174936,...
## $ poster_path           <chr> "/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg", "/w9...
## $ production_companies  <chr> "[{'name': 'Paramount Pictures', 'id': 4...
## $ production_countries  <chr> "[{'iso_3166_1': 'US', 'name': 'United S...
## $ release_date          <chr> "2/20/15", "8/6/04", "10/10/14", "3/9/12...
## $ runtime               <int> 93, 113, 105, 122, 118, 83, 92, 84, 100,...
## $ spoken_languages      <chr> "[{'iso_639_1': 'en', 'name': 'English'}...
## $ status                <chr> "Released", "Released", "Released", "Rel...
## $ tagline               <chr> "The Laws of Space and Time are About to...
## $ title                 <chr> "Hot Tub Time Machine 2", "The Princess ...
## $ Keywords              <chr> "[{'id': 4379, 'name': 'time travel'}, {...
## $ cast                  <chr> "[{'cast_id': 4, 'character': 'Lou', 'cr...
## $ crew                  <chr> "[{'credit_id': '59ac067c92514107af02c8c...
## $ revenue               <int> 12314651, 95149435, 13092000, 16000000, ...

Those variables are classifying to two main type of data, character/text and numeric/integer.

  • The character/text variables are belong to collection, genres, crew, cast… It is worth noting that the text structure of those character variables look like complicated. For example in the first value of genres “[{‘id’: 35, ‘name’: ‘\(\color{red}{\text{Comedy}}\)’}]”, , only “Comedy” is informative. This first problem leads to a required pre-processing step to remove all unnecessary character and/or extract only the required data for analysis.

  • The numeric variables are budget, revenue, popularity, runtime.

Following is the summary table on variables:

Variables Type Description
budget integer Budget of a movie in American Dollar (USD) and not be adjusted
for inflation.
popularity numeric Popularity was based on user interactions on the TMDb website
runtime integer Duration in minutes of a movie
revenue integer Revenue of a movie in American Dollar (USD) and not be adjusted
for inflation. Recommended resources for box office revenue
information: Box Office Mojo and The Numbers.
release_date date / time Release date of a movie
original_language character The original language of a movie
original_title character The original title is usually the title of the original version
of the film when it is first officially released locally.
overview character Overviews should describe the plot of the movie. They should be
to the point, spoiler-free and brief. A few lines at most.
poster_path character link to movie’s póter
production_companies character list of production companies
production_countries character list of production countries
spoken_languages character Only the languages spoken in the original version. No
translated/dubbed languages.
tagline character A movie tagline is usually a short promotional text used on
the poster.
Keywords character to describe the plot of movie. Around 5-10 keywords for TV
zshows and 15-20 keywords maximum for movies is reasonable.
cast character Only the cast of the original (not dubbed or extended) version.
crew character Only the crew credited in the original version.

More description on variables could be reference from following link

https://www.themoviedb.org/bible/movie/

NA value status

Missing values must be dropped or replaced in order to draw correct conclusion from the data. The second problem of this dataset is missing value in many variables.

Two highest NA ratio variables arebelongs_to_collection and homepage, which are more than 60%. budget, which suppose the most important variables, also have 27.07% NA values. Other tagline, Keywords, production_companies have lower NA value ratio, less than 20%.

Since there are many available publicly dataset in internet on budget number, movies description… as well as the competition rules allow to use any publicly dataset which are available before the movie’s release data. I decide to use the additional data from The Movie Database itself and Wikipedia website to fulfill all missing values.

Additional dataset

The third problem is that there were many concern on the quality of dataset provided in this Kaggle competition, especially the accuracy of budget, revenue, popularity… [2]

In addition, the moment I have been working in this project later on the Competition timeline. Therefore, this might lead to unforseen circumstance due to inaccuracy data. Hence, to assure the accuracy and consistency of the dataset, in this project I set priority to use the additional TMDb data at first, then the second priority is Kaggle’s dataset, the Wikipedia is to use ultimately. The publicly additional data are from following sources:

The Movies Database API

The additional data from TMDB website is openned and able to download by following their instruction.

The instruction to retrieve data from TMDb website is reference from following link (http://www.planetanalytics.in/2017/05/how-to-extract-movie-data-from-movie.html).

The TMDb packages was also available in R libraries [10] which you can retrieve other features for this competition.

Import data

Due to very long processing time to download data from TMDb website, I did not include the code in this report, but you can download data direct from my github repo.

Data overview

This data set have 7398 observations of 14 variables.

## Observations: 7,398
## Variables: 14
## $ imdb_id          <fct> tt1380152, tt0391024, tt0117110, tt0093857, t...
## $ new_budget       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ vote_count       <int> 10, 30, 170, 117, 114, 626, 15, 331, 55, 450,...
## $ vote_average     <dbl> 5.6, 7.4, 6.4, 5.4, 6.0, 5.9, 5.0, 5.6, 5.5, ...
## $ new_popularity   <dbl> 4.077, 2.293, 8.695, 6.974, 6.378, 10.905, 1....
## $ new_revenue      <dbl> 3923970, 2586511, 34327391, 22642033, 1234254...
## $ new_runtime      <int> 118, 84, 100, 98, 111, 116, 92, 87, 95, 117, ...
## $ new_release_date <fct> 2/5/2009, 1/15/2004, 2/16/1996, 7/10/1987, 12...
## $ genres1          <fct> Adventure, Adventure, Drama, Action, Horror, ...
## $ genres2          <fct> Family, Comedy, Comedy, Crime, Thriller, Anim...
## $ genres3          <fct> Horror, Music, Drama, Drama, Mystery, Science...
## $ company1         <fct> Ghost House Pictures, Amercent Films, DreamWo...
## $ company2         <fct> North Box Productions, 20th Century Fox, Doub...
## $ company3         <fct> Walt Disney Pictures, Jinks/Cohen Company, Je...

Summary table of additional data:

Variables Type Description
imdb_id character identified number in imdb system, extract from Kaggle’s dataset
new_budget integer the budget number downloaded from TMDb
vote_count integer the number of vote given by users after movie was released
vote_average double the average vote given by users after movie was released
new_popularity double the popularity number downloaded from TMDb
new_revenue integer the revenue downloaded from TMDb
new_runtime integer runtime number downloaded from TMDb
new_release_date date/time release date downloaded from TMDb
genres1 character first genre of a movie
genres2 character second genre of a movie
genres3 character third genre of a movie
company1 character first company of a movie
company2 character second company of a movie
company3 character third company of a movie

Wikipedia

The second additional data was scrapped from wikipedia, using rvest package. Because of long processing time to scrapping and pre-processing, I didn’t copied those scrapping code in my report. If you’re interested with this code, you can refer it and/or download the completed data from my github repo. The code to download and import the additional data was as following:

Import data

Overview data

## Observations: 1,842
## Variables: 5
## $ imdb_id       <fct> tt3950078, tt0099581, tt1440292, tt0456020, tt12...
## $ wiki_budget   <fct> "US$35 Million", "US$35 Million", "US$35 Million...
## $ new_budget    <dbl> 35, 35, 35, 35, 35, 35, 25000, 25000, 25000, 250...
## $ currency_unit <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ final_budget  <dbl> 3.5e+07, 3.5e+07, 3.5e+07, 3.5e+07, 3.5e+07, 3.5...

Summary table of wikipedia dataset:

Variables Type Description
imdb_id character the identified number in imdb system
wiki_budget character the original scrapping from wikipedia
new_budget integer extracted number from wiki_budget (unit probably single,
thousand, million usd/jpy/cad/…)
currency_unit integer currency exchange number from other currency to usd
final_budget double final budget after transforming to usd

Data overview summary

In summary, top 3 problems were recognized in this section and the respective solution to deal with those problems were as following:

No Problem Solution
1 missing/NA values in many use new budget from additional data (TMDb & wikipedia)
variables, especial in to update/replace for missing/wrong value
budget
2 complicated text structure use text mining function in package stringr to replace
in character variables unnecessary characters and/or extract the required
(belongs_to_collection, data for analysis
genres, crew, cast,
production_companies…)
3 inaccuracy data priority to use data from additional data (TMDb & wikipedia)
instead of original data from Kaggle competition

Analysis

Data exploratory analysis

Release date

The first noting that release data was as type character, then we have to convert it to date-and-time format.

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1969-01-01" "1995-09-15" "2006-02-19" "2005-01-15" "2012-08-31" 
##         Max.         NA's 
## "2068-12-30"          "1"

The second noting that the time frame was really wide, the soonest released date was 1969-01-01 and the latest was “2068-12-30”. In addition, the latest release_date “2068-12-30” was impossible. Moreover, there were total 357 days later than the time I was doing this report on August \(27^{th}\), 2019 and 1 NA value. Hence, it’s certainly not enough to assure the original release date was reliable or not.

For all these reasons, I used the release date in TMDb additional data set to replace the original release data from Kaggle data.

Revenue

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.000e+00 2.380e+06 1.681e+07 6.673e+07 6.892e+07 1.520e+09

First of all, we could see the minimum revenue was 1 and the maximum revenue was 1519557910. In a little more detail,, there were total 57 movies had revenue value less than 1000. Those abnormal revenue number’s caused by the currency unit was recorded in million USD while almost movies currency unit was recorded in single USD.

Hence, to assure all revenue number was up-to-date and its reliability in my project, in like manner to budget number, I used the revenue from TMDb data to replace the original revenue number in Kaggle data.

The revenue number was used to evaluate the final modeling performance was also extracted as following:

Budget

Certainly, budget would be the most important variable to predict revenue of a movie.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         1   5053316  17000000  31108016  40000000 380000000      2023

Firstly, we could see 2023 NA values, approximate 27.3 % of total observation. Then to deal with those NA values, I replaced them by the budget number from additional data, which were downloaded from The Movie Database and Wikipedia.

Secondly, the minimum budget is just 1, this is the same issue and caused by the same reason similar to abnormal revenue number in previous section. In addition, some movies have budget recorded in million USD (value less than $1,000), but its revenue recorded in single USD (value up to $1,000,000). In contrast, some movies have budget record in single USD (value up to $1,000,000) but its revenue recorded in million USD (value less than $1,000).

On the whole, I summarized all issues related to budget number as following table.

budget_status revenue_status count percent remarks
low budget low revenue 13 0.4
low budget normal revenue 5 0.2 wrong budget
missing budget low revenue 34 1.1 NA impact
missing budget normal revenue 778 25.9 NA impact
normal budget low revenue 10 0.3 wrong revenue
normal budget normal revenue 2160 72.0

Hence, to prevent those mistakes impact to the performance of final predicting model, I replaced them by the value from additional data (TMDb & Wikipedia).

New budget

Firstly, I replaced abnormal budget numbers (value less than $1,000) if its respective revenue was normal (value greater than $10,000). Then, I replace the NA value. This step dropped NA value from 27.07% to 12%.

## Warning: Column `imdb_id` joining factors with different levels, coercing
## to character vector
## [1] 0.1197621

The NA value was reduced from 27.07% to 12% . Because the Wikipedia budget number is not available for every movies, therefore I replaced remain NA value by median budget value. The median budget values were calculated by each released year to minimize impact of time.

## Warning in df$new_budget[is.na(df$new_budget)] <- df$median_budget: number
## of items to replace is not a multiple of replacement length
## [1] 0

Runtime

The missing values in runtime was to be replaced by median run time value.

Now we can take the first look on the corrected value of budget, revenue, popularity and runtime.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We could see:

  • Not many movies have high revenue. 75.1 % movies in the training dataset have revenue lower than the average revenue. Only 24.9 % movies have revenue greater than the average revenue.

  • Budget, revenue, popularity seem to have positive skew distribution.

Profit ratio & Return on Investment (ROI)

Return on investment (ROI) is a ratio between net profit and cost of investment. A high ROI means the investment’s gains compare favorably to its cost. As a performance measure, ROI is used to evaluate the efficiency of an investment or to compare the efficiencies of several different investments.

## [1] 2.599791

In average, the profit ratio of a movie was 260%.

To evaluate the ROI of different movies, following was the summarized table of top 10 highest ROI movies:

title budget revenue ROI popularity
Paranormal Activity 15000 193355800 12889 12.7
The Blair Witch Project 60000 248000000 4132 14.8
The Tiger: An Old Hunter’s Tale 5000 11083449 2216 3.4
Madea Goes to Jail 80000 90508336 1130 2.8
Fuck You Goethe 2 121000 83027924 685 9.2
Pink Flamingos 12000 6000000 499 5.7
Gunfight at the O.K. Corral 25000 11750000 469 10.5
We Are from the Future 2 30000 8910819 296 1.2
Beaches 200000 57041866 284 16.2
The Legend of Boggy Creek 100000 22000000 219 1.5

The most profitable movie was “Paranormal Activity”, which was a 2007 American supernatural horror film co-produced, written, directed, photographed and edited by Oren Peli. It centerred on a young couple (Katie Featherston and Micah Sloat) who were haunted by a supernatural presence in their home. The film had earned nearly $108 million at the U.S. box office and a further $85 million internationally for a worldwide total of $193 million [8].

## [1] 1231

The second most profitable movie is “The Blair Witch Project”. Reference from wikipedia, The Blair Witch Project grossed nearly $250 million worldwide on a modest budget of $60,000, making it one of the most successful independent films of all time [9].

## [1] 1680

Not too many movies could be successful like both movies. To drop out its impact to final predicting model, I noted its row indexes in a vector named outlier_rows.

Correlation between revenue and budget, popularity, runtime

To evaluate the correlation and interaction between revenue and budget, popularity, runtime, I added new features includes:

  • budget_pop represent for interaction between budget and popularity.

    • benefit represent for the ratio between revenue and budget.

    • profit represent for the margin between revenue and budget.

We could see:

  • revenue was highest correlated to budget_pop (0.78), and new_budget (0.74).

    • it’s surprise that revenue/budget ratio seem not to correlate to any variables.

    • runtime seem not to correlate to revenue and other variables.

The following figures used to visualize the correlation between revenue vs budget, popularity:

## Warning: Removed 13194 rows containing non-finite values (stat_smooth).

We could see:

  • The scatter between budget & revenue was more diffuse than the scatter between budget_pop & revenue.

    • One popularity outlier (value > 100) break the smooth fitting line. The row index of this outlier could be determined by following code:
## [1] 2859

Adding this outlier to vector outlier_rows.

Change across time

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1915-02-08" "1992-10-26" "2004-07-15" "2000-02-25" "2011-06-17" 
##         Max. 
## "2017-08-11"

The soonest release date was “1915-02-08” and the latest release date was “2017-08-11”, the period of time were 97 years. Consumer behaviours have changed over the years: the MeToo movement, as well as other social developments, have surfaced in our society, and that would impact on the movies box office industry as well.

We could see:

  • The movies box office industry growth slowly before decade 1980s , after this time it’s continuous growing every year. 25% movies released in total 72 years from 1921 to 1993, but the next 25% movies released in only 8 years from 1993 to 2001.

    • 50% remain movies released in 17 after 2001.

We could see, in average:

  • 44.2 movies released on Friday, 18.9 movies released on Thursday, 17.1 movies released on Wednesday, remain movies released on other weekdays.

    • Movies released on Wednesday have highest revenue compared to other weekdays.

belongs_to_collection

This variable have greatest number of NA values. There are 79.98 % NA values and 20.02 % not-NA values. The NA values represented for a movie not belongs to any collection, otherwise a movie collection represent by following text:

[{‘id’: 313576, ‘name’: ‘\(\color{red}{\text{Hot Tub Time Machine Collection}}\)’, ‘poster_path’: ‘/iEhb00TGPucF0b4joM1ieyY026U.jpg’, ‘backdrop_path’: ‘/noeTVcgpBiD48fDjFVic1Vz7ope.jpg’}]

For each collection, I remove all unnecessary characters and kept only the collection’s name (\(\color{red}{\text{red text}}\)). The process performed by following code:

There were total 725 collection from both train_set and test_set. To visualize the difference between movies belong to a collection and movies not belong to any collection, I added one new feature collection_status to classify a movie belong to a collection or not. The following code and figure represented for this difference.

## Warning: Removed 4398 rows containing non-finite values (stat_boxplot).

We could see:

In average, a movie belongs to a collection has higher median revenue than other movies not belong to any collection.

genres

Let’s take a look on the genres.

## [1] "[{'id': 35, 'name': 'Comedy'}]"                                                                                                
## [2] "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]"

Some movies have one genre, other movies could had more than one genre. Each genre represent by. Each genre represents by following text “{‘id’: 35, ‘name’: ‘\(\color{red}{\text{Comedy}}\)’}” and separate by comma. For each genres value, I remove all unnecessary character and kept only the genre’s name (\(\color{red}{\text{red text}}\)).

Number of genres

To explore the correlation between number of genres vs movie revenue, I created the new variable to visualize the total genres of each movie.

Now we evaluate the correlation between budget, revenue on the number of genres.

## Warning: Removed 1 rows containing missing values (position_stack).

We could see:

  • Almost movies have two or three genres.

    • In average, movies which number of genres were from 3 to 6 seem to have higher average revenue compared to other movies which less than 3 genres or greater than 6 genres.

NA value

There was also 23 NA value. I replaced those NA values by “no genre”.

First genre

It is my impression that the first genre would be the main genre of each movie. To evaluate the difference between each genre, I extracted the first genre of each movie and add them to a new variable named main_genre.

Now figure out the difference of each main genre, by the number of movies and revenue.

We could see:

  • Top three common genres with highest number of movies were Drama, Comedy, Action.

    • Top three genres with greatest average revenue were Adventure, Animation, Science Fiction, and the average revenue of each genre are quite different compared to others.

production_companies

From previous section, we known there were total
414 NA values, approximate 5.6 % of total observation. Let’s take a quick look on the top rows of this variable.

## [1] "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
## [2] "[{'name': 'Walt Disney Pictures', 'id': 2}]"                                                                                       
## [3] "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"

Some movies have one production company, other movies could had more than one production company. Each company represent by a name (string), and its id (number), for example {‘name’: ‘\(\color{red}{\text{Paramount Pictures}}\)’, ‘id’: 4}. For each production company, I remove all unnecessary character and kept only the production company’s name (\(\color{red}{\text{red text}}\)). For 414 NA values, I replaced by “no production companies info”.

Number of production companies

To explore the effect of number of production companies on the revenue, I created a new feature named number_of_company represent for the number of production companies of each movie. For NA value, I replaced by the median value. The process performed by following code:

Figure out the number of production company per movie and how it correlated to the revenue.

We could see:

90.5 % movies have less than or equal 5 production companies. The revenue seem to decrease when the number of production company greater than 5.

Top production companies effect

In the same manner, to evaluate the effect of top production companies on the revenue, I suppose that the first company would be the main company, who might have biggest impact on the production as well as quality of a movie. Then I created a summary data group by each production company to rank and evaluate their performance. The metric to evaluate performance of a production company were number of released movies, average budget, average revenue, ROI per company…

This process performed as following. The first step was to replaced all unnecessary characters and kept only the production company’s name in a new variables named companies.

Extract the first company and store into variables named first_company.

The summary data of production companies was created by following code:

Following was the summary table of top 10 production companies and their performance metrics.

first_company movies_per_company avg_budget_per_company avg_revenue_per_company ROI_per_company
no production companies info 414 8897759 3715800 -0.582
Universal Pictures 401 34826952 106734116 2.065
Paramount Pictures 389 40354231 117855829 1.921
Twentieth Century Fox Film Corporation 291 33140639 99184949 1.993
Columbia Pictures 235 51558467 117958388 1.288
New Line Cinema 187 30698316 102169356 2.328
Warner Bros.  162 24934157 74005740 1.968
Walt Disney Pictures 146 83791171 308776496 2.685
Metro-Goldwyn-Mayer (MGM) 109 12042941 17718473 0.471
Columbia Pictures Corporation 105 27680048 66261848 1.394

We could see:

Although there were total 3000 movies in train data set and 1064 production companies in variables first company, but:

  • Top 10 companies appeared as the first company in 826 movies, approximate 27.5 %.

  • only “Universal Pictures”, “Paramount Pictures” and “Walt Disney Pictures” contribute approximate 27% of total revenue.

  • 2.1% movies which were missing Production company information, had negative ROI (-0.547).

production_countries

## [1] "[{'iso_3166_1': 'US', 'name': 'United States of America'}]"

Similarly, some movies had one production country, while other movies had more than one production country. Each company represent by following text “{‘iso_3166_1’: ‘\(\color{red}{\text{US}}\)’, ‘name’: ‘United States of America’}” and separate by comma. For each production company, I remove all unnecessary character and kept only the production company’s name (red text). For NA values, I replaced by “no country info”.

The process performed by following code:

We could see, in average:

  • 60.7% movies produced in United State of America.

    • Movies from countries US, GB, CA, DE, AU, JP, CN have better revenue than movies from other countries.

    • Movies with no country information have lower revenue than other movies.

To add and evaluate effect of production country later on predicting model, a vector named top_countries to be created as following:

Features engineering

Following is the summary table of what insighs gained from previous section and its respective possible engineering process.

No Factors Insights Possible engineering process
1 Time-effect Movies box office industry growth Add new feature to
after that. minimize time effect.
Almost movies released on Friday, Add time features to
movies released on Wednesday had predicting model.
higher average revenue.
2 Budget, revenue, Budget, revenue, popularity have Logarithm transformation
popularity positive skew distribution.
Revenue is highest correlated to Add new feature represent for
budget x popularity & budget for interaction between budget
and popularity
3 Average profit ratio In average, a movie have profit Multiply budget by profit ratio
ratio 2.5997906 to calculat e predicted revenue.
4 Belong_to_collection Movies belongs to a collection Add collection status to
seem to have higher revenue. predicting model.
5 Genres Different genres have significant Add new feature to represent
different revenue. for all genres of a movie.
A movie could have more than one
genre.
Movies which number of genres Add features number of genres
were from 3 to 6 seem to have to predicting model.
higher average revenue.
6 Production companies Top 10 production companies Add features to recognize a
contribute 27% total release movies movie belong to a top production
Top production companies dominate or missing production company info.
company, the revenue percentage.
A movie missing production
company info have negative ROI.
7 Production countries Almost movie produced from US. Add features to recognize a
Movies from top countries have movie belong to a top production
better revenue than others. country, or missing production
Movies missing production country info.
countries info have lower revenue

Time-effect: Normallized popularity

From previous section, we known the period of time from first movie released date to the latest movie released date was 97 years. Furthermore, the popularity was calculated based on user interactions on the TMDb website, while the internet users and their time in internet increased by the time. To reduce the change across time, I added one more feature normalized_popularity by dividing popularity by the average popularity per release year. The calculation was as following:

Adding second time normallized popularity feature.

Budget and popularity interaction

New features named budget_norm_pop represent for the interaction between budget and normalized popularity to be added by following code:

Expected revenue based on average profit ratio

The expected revenue based on average profit ratio was calculated by following code:

Genres

To add the effect by genres, I created 19 new features named by each genre. Each feature had two levels 0 and 1, which:

  • 0: if this genre didn’t present in the genres list of movie

    • 1: if this genre present in the genres list of movie

Top production companies

To add the effect by top 10 production companies, I created 10 new features named by each production company. Each feature had two levels 0 and 1, which:

  • 0: if the production company didn’t present in the production companies list of movie

    • 1: if the production company present in the production companies list of movie

For other production companies did not belong to top 10 production companies, I assigned to a variables named “other_production_company”, with similarly two levels 0 and 1.

Top production countries

To add the effect by top production country, I created new features named by each production country. Each feature had two levels 0 and 1, which:

  • 0: if the production country didn’t present in the production countries list of movie

    • 1: if the production country present in the production countries list of movie

For other production countries did not belong to top production countries, I assigned to a variables named “other_production_countries”, with similarly two levels 0 and 1.

Modeling

Design of experiment

The experiments were designed to evaluate (i) the effect of different variables, (ii) different machine learning methods and the best performance model, (iii) effect of the logarit transformation on modeling performance. The RMSLE was used to evaluate modeling performance. Following was summarized table of experimented models in this section.

Exp# Model name Description Objective
1 Naive model use average revenue as predicted the RMSLE of this model to be used as
revenue. the baseline to evaluate performance of
other experiments.
2 Best ML methods compare different machine to select the machine learning method
learning method performance. The with best performance.
methods to be experimented were
random forest (rf), bayesian
generalized linear model (bayesglm),
generalized linear model boosting
(glmboost), linear model (lm).
3 Features selection use multiple features to predict to select features for final predicting
revenue. model.
4 Logarithm transform use logarit to transform the data. to evaluate the effect of logarithm on
machine learning performance.

Experiment 2: Best ML methods

In this experiment, I compared performance of random forest (rf), bayesian generalized linear model (bayesglm), generalized linear model boosting (glmboosting), linear model (lm). At the end of this experiment, we would determine what method have best performance.

The predicted revenues were calculated as following:

Create a data frame to contain all predicted revenue from different machine learning method.

Random forest performed better results than other methods and previous models. The RMSLE of random forest was 3.5026872. Therefore I used random forest for remain experiments.

Following was the RMSLE table of experimented methods:

Exp_No Experiment rf bayesglm glmboost lm
2 Evaluate machine learning methods 2.045 2.399 2.424 2.475

Experiment 3: Features selection

Objective of this experiment was to evaluate the importance and select the features for predicting model.

Many models that can be accessed using caret’s train function produce prediction equations that do not necessarily use all the predictors. In many cases, using these models with built-in feature selection will be more efficient than algorithms where the search routine for the right predictors is external to the model. Built-in feature selection typically couples the predictor search algorithm with the parameter estimation and are usually optimized with a single objective function (e.g. error rates or likelihood). [3]

Feature Selection using Univariate Filters

An approach to feature selection is to pre-screen the predictors using simple univariate statistical methods then only use those that pass some criterion in the subsequent model steps. Similar to recursive selection, cross-validation of the subsequent models will be biased as the remaining predictors have already been evaluate on the data set. Proper performance estimates via resampling should include the feature selection step. [4]

33 variables were selected includes:

##  [1] "expected_revenue"                      
##  [2] "revenue_pop"                           
##  [3] "revenue_norm_pop"                      
##  [4] "new_budget"                            
##  [5] "budget_pop"                            
##  [6] "budget_norm_pop"                       
##  [7] "new_popularity"                        
##  [8] "normallized_popularity"                
##  [9] "collection_status"                     
## [10] "number_of_company"                     
## [11] "number_genres"                         
## [12] "Action"                                
## [13] "Adventure"                             
## [14] "Animation"                             
## [15] "Documentary"                           
## [16] "Drama"                                 
## [17] "Fantasy"                               
## [18] "Foreign"                               
## [19] "History"                               
## [20] "Romance"                               
## [21] "Science Fiction"                       
## [22] "no production companies info"          
## [23] "Universal Pictures"                    
## [24] "Paramount Pictures"                    
## [25] "Twentieth Century Fox Film Corporation"
## [26] "Columbia Pictures"                     
## [27] "New Line Cinema"                       
## [28] "Warner Bros."                          
## [29] "Walt Disney Pictures"                  
## [30] "other_production_company"              
## [31] "US"                                    
## [32] "other_production_countries"            
## [33] "release_year"

The predicted revenue was calculated as following.

The RMSLE was 2.0527818 , which was significant improvement on the performance.

Confounding between variables

In statistics, a confounder (also confounding variable, confounding factor, or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. [5]

In previous section, I added the interaction features between revenue, budget, popularity… The following correlation matrix showed the correlation between features related to budget, revenue, popularity…

We could see high correlation number between features represented for interaction related to revenue (revenue-group features includes expected_revenue, revenue_pop, revenue_norm_pop) & features represented for interaction related to budget (budget-group features includes new_budget, budget_pop, budget_norm_pop).

To explore the effect of include/exclude above features (revenue-group features and budget-group features), I evaluated performance of three following models:

Model_No Machine Learning method Feature selection Preprocessing
1 Random Forest (rf) using Univariate Filters
2 Random Forest (rf) using revenue-group features exclude new_budget, budget_pop,
budget_norm_pop
3 Random Forest (rf) using budget-group features exclude expected_revenue,
revenue_pop, revenue_norm_pop

The experiment process performed by following code:

The results were summarized as following table:

Univariate_Filters all_features Revenue_group_Features Budget_group_Features
2.053 2.051 2.052 2.045

We could see the lowest RMSLE in model - Feature Selection using Univariate Filters, but the RMSLE gap between all models were not quite different.

Experiment 4: Logarithm transformation

One insight from previous section that the budget and revenue had a positive skewed distribution shape. However, many statistic methods have been developed to test the normality assumption of observed data. When the distribution of the continuous data is non-normal, transformations of data are applied to make the data as “normal” as possible and, thus, increase the validity of the associated statistical analyses. The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. [3]

Therefore, the objective of this experiment is to evaluate the effect of logarit transformation on the results.

Before process modeling, I transformed the revenue, budget and its respective features using logarit function logb.

We could see the distribution shape changed before and after log transformation.

The machine learning method using in this experiment was Random Forest, using Univariate Filter.

The predicted revenue was calculated as following:

Exp_No Experiment rf
4 Logarit transformation 1.666

We could see the RMSLE was 1.6655832 , which was a significant improvement compare to other models in previous experiments.

Experiments summary

The experiment results was summarized as following:

  • Random forest was performed better than bayesian generalized linear model (bayesglm), generalized linear model boosting (glmboost), linear model (lm). Therefore, I selected random forest as the machine learning method for final models.

    • Model using Univariate Filters to select features performed better model using revenue-group features and budget-group features, but the gap was not significant different.

    • Models using logarit transformation had better results (lower RMSLE) than other non-logarit transformation models in previous experiments.

Experiments Summary
Exp_No Experiment normal_calculation rf bayesglm glmboost lm
1 Naive model 3.097
2 Evaluate machine learning methods
2.045 2.399 2.424 2.475
3 Feature selection
2.053
4 Logarit transformation
1.666

Visualize the final results

Results & discussion

Final models for validation

As a consequence, I trained 3 models for final validation in test data. The machine learning method, feature selection, pre-processing were summarized as following:

Model_No Machine Learning method Feature selection Preprocessing
1 Random Forest (rf) using Univariate Filters logb transforming
2 Random Forest (rf) using revenue-group features logb transforming
3 Random Forest (rf) using budget-group features logb transforming

Pre-processing data using logarith transformation method.

Model 1: Feature Selection using Univariate Filters

Train model:

Final predicted revenue:

Model 2: Revenue-group features

Train model:

Final predicted revenue:

Model 3: Budget-group features

Train model:

Final predicted revenue:

Discussion on modeling performance

In previous experiments, the RMSLE from three models were not quite different. Nevertheless, in the final validation, the best performance results was received from the model using revenue-group features, with RMSLE = 2.3446715 . Following was the model using Univariate Filters for feature selection, with RMSLE = 2.4292049. The worst performance was received by model using Budget-group features, with RMSLE = 2.5831699.

On the other hand, the best RMSLE on validated on test data was 2.345 , which was higher than the RMSLE receive from experiment 3, RMSLE = 1.666. It might cause of the unbalancing data between train and test data. Certainly, the train data have 3000 rows (approx ~40% total observations) and test data have 4398 observations (approx ~60% total observations). To be more precise, approx 65% movie’s collection appeared in both train and test data. This issue might lead to an inaccuracy predicted results for a movie with unseen predictors as well.

## Warning in na.omit(df$collection[1:3000]) ==
## na.omit(df$collection[3001:7398]): longer object length is not a multiple
## of shorter object length
## [1] 0.6507503

To deal with imbalance data issue, one approach is to collect more data, which are available in The Movie Database. Furthermore, in this model I didn’t analyze and include remain features such as cast, crew, Keywords… which might effect on the movie revenue.

What’s most important features?

The important features plot are as following:

The top 3 important features were:

  • revenue_norm_pop, represents for the interaction between revenue and normalized popularity,

    • normallized_popularity, represents for the normalized popularity by release year,

    • new_popularity, represents for the popularity of a movies.

Conclusion

In this report I described a way to build up the machine learning model to predict a movie revenue before its release date, using data from Kaggle TMDb Box Office Competition and other publicly data from The Movie Database website and Wikipedia. The original features, which were analyzed and included in final predicting model, were budget, popularity, genres, belong to collection, production companies, production countries. Four machine learning methods were experimented in a cross-validation dataset, Random Forest (rf) give better results (RMSLE 2.045) than Generalized Linear Model Boosting (glmboost) (RMSLE 2.424), Bayesian Generalized Linear Model (bayesglm) (RMSLE 2.399), Linear Model (lm) (RMSLE 2.475). Three approach for feature selection were experimented with no significant difference on modeling performance (Univariate Filter RMSLE 2.053 , revenue-group features RMSLE 2.052 , budget-group features RMSLE 2.045). Since revenue and budget had a skewed-distribution, a data pre-processing approach using log transformation (base e) on the model using Random Forest and Univariate Filters were experimented and give best performance (RMSLE 1.666) on cross validation dataset.

Finally, three models were developed and validated on the test data, with best performance (RMSLE 2.345) belong to model using revenue-group features. The most importance features on best performance model are revenue_norm_pop represents for the interaction between revenue and normalized popularity, normallized_popularity represents for the normalized popularity by release year, new_popularity represents for the popularity of a movies.

Because those models were developed on a 3000 observation train dataset and validated on a 4398 observation test dataset, it’s limited to provide an accuracy results for a movie with unseen predictors. However, this is also an opportunity to improve modeling performance in the future by adding more observation in the training set. Furthermore, other features were not analyzed and included in predicting model, which are also analyzed and added-in for further improvement.

Another point that although random forest give better performance on RMSLE than other machine learning methods, its processing time is quite longer and might be limited if the training dataset is more bigger. For this reason, alternative machine learning methods are able to experiment to improve modeling speed in the future.

Reference

[1] TMDB Box Office Prediction - https://www.kaggle.com/c/tmdb-box-office-prediction/overview

[2] TMDB Box Office Prediction - Discussion - https://www.kaggle.com/c/tmdb-box-office-prediction/discussion

[3] The caret Package - Max Kuhn - 2019-03-27 - Feature Selection Overview - https://topepo.github.io/caret/feature-selection-overview.html

[4] The caret Package - Max Kuhn - 2019-03-27 - Feature Selection using Univariate Filters - https://topepo.github.io/caret/feature-selection-using-univariate-filters.html

[5] Confounding - https://en.wikipedia.org/wiki/Confounding

[6] Log-transformation and its implications for data analysis - Changyong FENG,1,, Hongyue WANG,1 Naiji LU,1 Tian CHEN,1 Hua HE,1 Ying LU,2 and Xin M. TU1 - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/

[7] The Movie Database - https://www.themoviedb.org/

[8] https://en.wikipedia.org/wiki/Paranormal_Activity

[9] https://en.wikipedia.org/wiki/The_Blair_Witch_Project

[10] TMDb package - https://cran.r-project.org/web/packages/TMDb/index.html